Amazon Speaks on Cloud Outage, Offers Apology and Credits -- Redmondmag.com

Amazon Speaks on Cloud Outage, Offers Apology and Credits

By Jeffrey Schwartz
04/29/2011

Amazon Web Services on Friday issued an apology for an outage that left certain customer sites down for days, causing permanent data loss in some cases.

The apology was accompanied by an explanation of the cause of the massive failure that occurred earlier this month. The company largely had been silent on the matter until today, except for disclosing some notes on its Service Health Dashboard.

"We want to apologize," the company stated in a post-mortem report. "We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services."

The problem began on Thursday, April 21, when the company was performing a routine network upgrade to an "Availability Zone," or hub, at its Northern Virginia datacenter in an attempt to increase capacity. The upgrade was executed incorrectly.

"During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Storage] network to allow the upgrade to happen," the company explained. "The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower-capacity redundant EBS network.

"For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving. As a result, many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, leaving the affected nodes completely isolated from one another."

The company plans to take steps to make sure such an event doesn't recur.

"We will audit our change process and increase the automation to prevent this mistake from happening in the future. However, we focus on building software and services to survive failures. Much of the work that will come out of this event will be to further protect the EBS service in the face of a similar failure in the future."

Customers that were affected by the outage will automatically receive 10-day credits equal to 100 percent of their usage of EBS volumes, Elastic Compute Cloud (EC2) instances and Relational Database Service (RDS) database instances that were running in the affected Availability Zone, Amazon said. While the credits will be welcomed by affected customers, in some cases they may not equal the business lost by the outage.

Amazon also promised to improve its communications in the future.

"We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what's going on, how long it will take to fix, and what we are doing so that it doesn't happen again."

About the Author

Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.

Featured

Microsoft's Record July Patch Tuesday Fixes 570 Flaws, Including Two Exploited Zero-Days

Microsoft's July Patch Tuesday release broke the record for a second straight month, delivering fixes for roughly 570 holes across Windows, SharePoint, Microsoft 365, Azure and others.
Why Most Backup Success Metrics Are Meaningless

Traditional backup metrics can show perfect health while failing to reveal whether critical workloads can actually be restored.
Microsoft Makes Passkeys the Entra ID Default as Identity Attacks Grow Stealthier

Microsoft will make passkeys the default authentication method in Entra ID and phase out its native delivery of SMS and voice authentication, a major shift aimed at reducing organizations' dependence on credentials that attackers can intercept or steal.
Sysdig Details Autonomous AI Agent Behind Ransomware Attack

Cloud security firm Sysdig has documented what it says is the first ransomware attack carried out from initial exploitation through encryption by an autonomous AI agent, without a human directing each step after the operation began.
Reading a 5.25 Inch Floppy Disk on Modern Hardware

A GreaseWeazle adapter and specialized software make it possible to recover files from decades-old 5.25-inch floppy disks.

comments powered by Disqus

Subscribe on YouTube

Office 365 Watch

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

TechMentor & Cybersecurity Live! @ Microsoft HQ
August 3-7, 2026

Virtual Hands-on Training Seminar: PowerShell Mastery Workshop: From Fundamentals to Advanced Automation
September 9-10, 2026

The AI Pivot
September 25, 2026

Live! 360 6-Week Training & Certification Course: Mastering the Microsoft AI Framework: Building Enterprise-Ready AI Agents with Microsoft Foundry
October 6–November 10, 2026

Live! 360 Orlando
November 15-20, 2026

Artificial Intelligence Live! Orlando
November 15-20, 2026

AI Enterprise Architecture Live! Orlando
November 15-20, 2026

Cybersecurity & Ransomware Live! Orlando
November 15-20, 2026

Data Platform Live! Orlando
November 15-20, 2026

TechMentor Orlando
November 15-20, 2026

Live! 360 2-Day Hands-On Seminar: AI-Powered .NET Development with Claude & Claude Code
December 8-9, 2026

Virtual Hands-on Training Seminar: AI-Powered PowerShell and Infrastructure Automation with Claude Code
December 10-11, 2026

TechMentor & Cybersecurity Live! @ Microsoft HQ
August 9-13, 2027

Webcasts

More Webcasts

Whitepapers

More Tech Library