Amazon's Big Mistake -- Redmondmag.com

The Schwartz Cloud Report

Amazon's Big Mistake

[UPDATE: Amazon released a detailed report explaining the cause of the outage on Friday. Read the story here.]

Amazon Web Services' four-day outage was a defining moment in the history of cloud computing -- not only for its impact but for the company's deafening silence.

The widely reported outage at Amazon's Northern Virginia datacenter left a number of sites crippled for several days, though Amazon most recently reported that service has been restored. However, the company has acknowledged that .07 percent of the Elastic Block Storage (EBS) volumes apparently won't be fully recoverable.

"Every day, inside companies all over the world, there are technology outages," Rackspace Chief Strategy Officer Lew Moorman told The New York Times. "Each episode is smaller, but they add up to far more lost time, money and business."

As for the Amazon outage, he added: "We all have an interest in Amazon handling this well." Did Amazon handle this well? Let's presume the company did everything in its power to remedy the problem and get its customers back online. Amazon has promised to issue a post-mortem once it gets everyone restored and figures out what went wrong.

But the company went dark from a communications perspective. Sure, it posted periodic updates on its Service Health Dashboard, but the company issued no other public statements on the situation as it was unfolding (though it was in direct communication with affected customers). Considering how visible Amazon technologists are on social media, including Twitter, a mere reference to the dashboard felt shallow.

"Most customers are saying today they have not been very transparent and open about what has exactly happened," Forrester analyst Vanessa Alverez told Bloomberg TV. "Their public relations to date has not been up to par."

Consider the communiqué of one of Amazon's customers affected by the outage. In a blog post called "Making it Right..." HootSuite explained to customers what happened and how it was going to make good on the downtime it experienced. Although its terms of service require reimbursement after a 24-hour outage and it was down for only 15 hours, HootSuite said it would offer credits.

"We acknowledge users were inconvenienced and we want to make things right," the company said. "We are taking steps to increase redundancy of our services and data across multiple geographic regions. This was a bit of a unique outage which is highly unlikely to occur again, but we'll be even more prepared for future emergencies."

During the outage and as of this writing a week after it first hit, no such communication has come from Amazon. PundIT analyst Charles King said in a research note that datacenter failures, even major ones, are inevitable, but communication is critical. He wrote:

"The fact that disaster is inevitable is why good communications skills are so crucial for any company to develop, and why Amazon's anemic public response to the outage made a bad situation far worse than it needed to be. Yes, the company maintained a site that regularly updated how repairs were progressing, and, to its credit, Amazon says it will publish a full analysis of the outage after its investigation is complete.

"But while the company has been among the industry's most vocal cloud services cheerleaders, it seemed essentially tone deaf to the damage its inaction was doing to public perception of cloud computing. At the end of the day, we expect Amazon will use the lessons learned from the EC2 outage to significantly improve its service offerings. But if it fails to closely evaluate communications efforts around the event, the company's and its customers' suffering will be wasted."

I remember during the dotcom boom over a decade ago when companies like Charles Schwab, E-Trade and eBay had highly visible outages that affected many thousands of customers. They took big PR hits for their lack of availability but their Web businesses prospered nonetheless.

While Amazon's outage will upgrade the discussion to the importance of resiliency and redundancy (those discussions were already happening), it seems highly unlikely that it will alter the move to cloud computing, even if it serves as a historic speed bump. "We shouldn't let Amazon off the hook and should expect a very thorough postmortem. But in no way does this change the landscape for the age-old public-private debate," writes analyst Ben Kepes.

While Amazon's outage was a black eye for cloud computing, providers of all sizes, including Amazon, will undoubtedly learn from the mistakes that were made, both technical and procedural. Hopefully, that will include better communications moving forward.

Posted by Jeffrey Schwartz on 04/28/2011

Featured

Microsoft Partners with UiPath on AI Automation

Microsoft and UiPath have announced a partnership in enterprise automation with the launch of integration between Microsoft Copilot Studio and UiPath Studio.
Critical Considerations for Server GPUs

Server GPUs offer powerful performance for AI workloads, but IT pros must weigh critical factors -- like form factor, power requirements and workload compatibility -- before installation.
April Patch Tuesday: 1 Zero-Day in Large Batch of Flaws

Microsoft's April security update arrived Tuesday, featuring fixes for 121 vulnerabilities – the biggest patch load for the year.
Q&A: Practical AI Strategies for IT Pros

AI expert Ana Inés Urrutia shares how IT pros can harness AI today to streamline operations, enhance decision-making and prepare for the future of work.
Microsoft Announces Azure AI with Copilot GA and Meta Llama 4 Integration

Microsoft has announced the general availability (GA) of Copilot in Azure and the addition of Meta's new Llama 4 models to Azure AI Foundry and Azure Databricks.