Amazon Apologizes, Explains Last Week's Cloud Outage in Dublin
Amazon Web Services apologized and issued a detailed post-mortem explaining the cause of a massive service outage that crippled its Dublin data center last week.
Originally thought to be a lightning strike, Amazon said it is not clear what caused the failure of a transformer that led to a power outage in the datacenter on August 7. But in any case, the subsequent malfunction of a programmable logic controller (PLC), which is designed to ensure synchronization between generators, led to the failure of the cutover of a backup generator.
It all went downhill from there. Without utility power, and the backup generators disabled, there wasn't enough power for all the servers in the Availability Zone to continue operating, Amazon said. The uninterruptable power supplies (UPSs) also quickly drained, resulting in power loss to most of the EC2 instances and 58 percent of the Elastic Block Storage (EBS) volumes in the Availability Zone.
Power was also lost to the EC2 networking gear that connects the Availability Zone to the Internet and to other Amazon Availability Zones. That resulted in further connectivity issues that led to errors when customers targeted API requests to the impacted Availability Zone.
Ultimately Amazon was able to bring some of the backup generators online manually, which restored power to many of the EC2 instances and EBS volumes but it took longer to resume power to the networking devices.
Restoration of EBS took longer due to the atypically large number of EBS volumes that lost power. There wasn't enough spare capacity to support re-mirroring, Amazon said. That required Amazon to truck in more servers, which was a logistical problem as it was night time.
Another problem: When EC2 instances and all nodes containing EBS volume replicas concurrently lost power, Amazon said it couldn't verify that all of the writes to all of the nodes were "completely consistent." That being the case, the assumption was that the volume was in an inconsistent state, even though the volumes may have actually been consistent.
"Bringing a volume back in an inconsistent state without the customer being aware could cause undetectable, latent data corruption issues which could trigger a serious impact later," Amazon said. "For the volumes we assumed were inconsistent, we produced a recovery snapshot to enable customers to create a new volume and check its consistency before trying to use it. The process of producing recovery snapshots was time-consuming because we had to first copy all of the data from each node to Amazon Simple Storage Service (Amazon S3), process that data to turn it into the snapshot storage format, and re-copy the data to make it accessible from a customer's account. Many of the volumes contained a lot of data (EBS volumes can hold as much as 1 TB per volume)."
It took until Aug. 10 to have 98 percent of the recovery snapshots available, Amazon said, with the remaining ones requiring manual intervention. The power outage also had a significant impact on Amazon's Relational Database Service (RDS).
Furthermore, Amazon engineers discovered a bug in the EBS software that was unrelated to the power outage that affected the cleanup of snapshots.
So what is Amazon going to do to prevent a repeat of last week's events?
For one, the company is providing to add redundancy and greater isolation of its PLCs "so they are insulated from other failures." Amazon said it is working with its vendors to deploy isolated backup PLCs. "We will deploy this as rapidly as possible," the company said.
Amazon also said it will implement better load balancing to take failed API management hosts out of production. And for EBS, the company said it will "drastically reduce the long recovery time required to recover stuck or inconsistent EBS volumes" during a major disruption.
During Amazon's last major outage in late April, the company received a lot of heat for not providing better communications. "Based on prior customer feedback, we communicated more frequently during this event on our Service Health Dashboard than we had in other prior events, we had evangelists tweet links to key early dashboard updates, we staffed up our AWS support team to handle much higher forum and premium support contacts, and we tried to give an approximate time-frame early on for when the people with extra-long delays could expect to start seeing recovery." the company said.
For those awaiting recovery of snapshots, Amazon said it did not know how long the process would take "or we would have shared it." To improve communications, Amazon indicated it will expedite the staffing of the support team in the early hours of an event and will aim to make it easier for customers and Amazon to determine if their resources have been impacted.
Amazon said it will issue a 10-day credit equal to 100 percent of their usage of Elastic Block Storage volumes, EC2 instances and RDS database instances that were running in the affected Availability Zone in the Dublin datacenter.
Moreover customers impacted by the EBS software bug that deleted blocks in their snapshots will receive a 30 day credit for 100 percent of their EBS usage in the Dublin region. Those customers will also have access to the company's Premium Support Engineers if they still require help recovering from the outage, Amazon said.
Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.