Amazon Apologizes, Explains Last Week's Cloud Outage in Dublin -- Redmondmag.com

Amazon Apologizes, Explains Last Week's Cloud Outage in Dublin

By Jeffrey Schwartz
08/18/2011

Amazon Web Services apologized and issued a detailed post-mortem explaining the cause of a massive service outage that crippled its Dublin data center last week.

Originally thought to be a lightning strike, Amazon said it is not clear what caused the failure of a transformer that led to a power outage in the datacenter on August 7. But in any case, the subsequent malfunction of a programmable logic controller (PLC), which is designed to ensure synchronization between generators, led to the failure of the cutover of a backup generator.

It all went downhill from there. Without utility power, and the backup generators disabled, there wasn't enough power for all the servers in the Availability Zone to continue operating, Amazon said. The uninterruptable power supplies (UPSs) also quickly drained, resulting in power loss to most of the EC2 instances and 58 percent of the Elastic Block Storage (EBS) volumes in the Availability Zone.

Power was also lost to the EC2 networking gear that connects the Availability Zone to the Internet and to other Amazon Availability Zones. That resulted in further connectivity issues that led to errors when customers targeted API requests to the impacted Availability Zone.

Ultimately Amazon was able to bring some of the backup generators online manually, which restored power to many of the EC2 instances and EBS volumes but it took longer to resume power to the networking devices.

Restoration of EBS took longer due to the atypically large number of EBS volumes that lost power. There wasn't enough spare capacity to support re-mirroring, Amazon said. That required Amazon to truck in more servers, which was a logistical problem as it was night time.

Another problem: When EC2 instances and all nodes containing EBS volume replicas concurrently lost power, Amazon said it couldn't verify that all of the writes to all of the nodes were "completely consistent." That being the case, the assumption was that the volume was in an inconsistent state, even though the volumes may have actually been consistent.

"Bringing a volume back in an inconsistent state without the customer being aware could cause undetectable, latent data corruption issues which could trigger a serious impact later," Amazon said. "For the volumes we assumed were inconsistent, we produced a recovery snapshot to enable customers to create a new volume and check its consistency before trying to use it. The process of producing recovery snapshots was time-consuming because we had to first copy all of the data from each node to Amazon Simple Storage Service (Amazon S3), process that data to turn it into the snapshot storage format, and re-copy the data to make it accessible from a customer's account. Many of the volumes contained a lot of data (EBS volumes can hold as much as 1 TB per volume)."

It took until Aug. 10 to have 98 percent of the recovery snapshots available, Amazon said, with the remaining ones requiring manual intervention. The power outage also had a significant impact on Amazon's Relational Database Service (RDS).

Furthermore, Amazon engineers discovered a bug in the EBS software that was unrelated to the power outage that affected the cleanup of snapshots.

So what is Amazon going to do to prevent a repeat of last week's events?

For one, the company is providing to add redundancy and greater isolation of its PLCs "so they are insulated from other failures." Amazon said it is working with its vendors to deploy isolated backup PLCs. "We will deploy this as rapidly as possible," the company said.

Amazon also said it will implement better load balancing to take failed API management hosts out of production. And for EBS, the company said it will "drastically reduce the long recovery time required to recover stuck or inconsistent EBS volumes" during a major disruption.

During Amazon's last major outage in late April, the company received a lot of heat for not providing better communications. "Based on prior customer feedback, we communicated more frequently during this event on our Service Health Dashboard than we had in other prior events, we had evangelists tweet links to key early dashboard updates, we staffed up our AWS support team to handle much higher forum and premium support contacts, and we tried to give an approximate time-frame early on for when the people with extra-long delays could expect to start seeing recovery." the company said.

For those awaiting recovery of snapshots, Amazon said it did not know how long the process would take "or we would have shared it." To improve communications, Amazon indicated it will expedite the staffing of the support team in the early hours of an event and will aim to make it easier for customers and Amazon to determine if their resources have been impacted.

Amazon said it will issue a 10-day credit equal to 100 percent of their usage of Elastic Block Storage volumes, EC2 instances and RDS database instances that were running in the affected Availability Zone in the Dublin datacenter.

Moreover customers impacted by the EBS software bug that deleted blocks in their snapshots will receive a 30 day credit for 100 percent of their EBS usage in the Dublin region. Those customers will also have access to the company's Premium Support Engineers if they still require help recovering from the outage, Amazon said.

About the Author

Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.

Featured

Microsoft's Record July Patch Tuesday Fixes 570 Flaws, Including Two Exploited Zero-Days

Microsoft's July Patch Tuesday release broke the record for a second straight month, delivering fixes for roughly 570 holes across Windows, SharePoint, Microsoft 365, Azure and others.
Why Most Backup Success Metrics Are Meaningless

Traditional backup metrics can show perfect health while failing to reveal whether critical workloads can actually be restored.
Microsoft Makes Passkeys the Entra ID Default as Identity Attacks Grow Stealthier

Microsoft will make passkeys the default authentication method in Entra ID and phase out its native delivery of SMS and voice authentication, a major shift aimed at reducing organizations' dependence on credentials that attackers can intercept or steal.
Sysdig Details Autonomous AI Agent Behind Ransomware Attack

Cloud security firm Sysdig has documented what it says is the first ransomware attack carried out from initial exploitation through encryption by an autonomous AI agent, without a human directing each step after the operation began.
Reading a 5.25 Inch Floppy Disk on Modern Hardware

A GreaseWeazle adapter and specialized software make it possible to recover files from decades-old 5.25-inch floppy disks.

comments powered by Disqus

Subscribe on YouTube

Office 365 Watch

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

TechMentor & Cybersecurity Live! @ Microsoft HQ
August 3-7, 2026

Virtual Hands-on Training Seminar: PowerShell Mastery Workshop: From Fundamentals to Advanced Automation
September 9-10, 2026

The AI Pivot
September 25, 2026

Live! 360 6-Week Training & Certification Course: Mastering the Microsoft AI Framework: Building Enterprise-Ready AI Agents with Microsoft Foundry
October 6–November 10, 2026

Live! 360 Orlando
November 15-20, 2026

Artificial Intelligence Live! Orlando
November 15-20, 2026

AI Enterprise Architecture Live! Orlando
November 15-20, 2026

Cybersecurity & Ransomware Live! Orlando
November 15-20, 2026

Data Platform Live! Orlando
November 15-20, 2026

TechMentor Orlando
November 15-20, 2026

Live! 360 2-Day Hands-On Seminar: AI-Powered .NET Development with Claude & Claude Code
December 8-9, 2026

Virtual Hands-on Training Seminar: AI-Powered PowerShell and Infrastructure Automation with Claude Code
December 10-11, 2026

TechMentor & Cybersecurity Live! @ Microsoft HQ
August 9-13, 2027

Webcasts

More Webcasts

Whitepapers

More Tech Library