The Schwartz Report

Blog archive

Breakdown in Procedures Caused Azure Outage

When Microsoft's Windows Azure cloud storage service went down worldwide late last month, the company confirmed within a few hours the cause of the massive meltdown.  An expired SSL certificate crippled the service late Friday Feb. 22 into the next day.

Furious customers wanted to know how something as simple as renewing a SSL cert could fall through the cracks. Even worse, how could that become a single point of failure capable of bringing down the entire service throughout the world?  It turns out the cause was a "breakdown in procedures," according to Mike Neil, general manager for Microsoft's Windows Azure service, in a contrite post-mortem posted Friday detailing the cause of the outage and plans to ensure the error isn't repeated.

 "A breakdown in our procedures for maintaining and monitoring these certificates was the root cause," Neil noted. "Additionally, since the certificates were the same across regions and were temporally close to each other, they were a single point of failure for the storage system."

Neil explained that Windows Azure has an internal service called the Secret Store that manages hundreds of certificates used to securely run the cloud service. The Secret Store alerted the team on Jan. 7 that the blob, queue and table certificates would expire on Feb. 22. It turns out the storage team did update the certificates but failed to flag a storage service release as one that included the updated certs.

"Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline," Neil explained. "Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system." Hence, that's how it fell through the cracks.

So what's Microsoft doing to ensure this doesn't happen again? Neil said the Windows Azure team will improve the process for detecting certificates that need to be renewed. Production certs due to expire in less than three months will generate and operational incident and will be tracked as what he termed a Service Impacting Event.

"We will also automate any associated manual processes so that builds of services that contain certificate updates are tracked and prioritized correctly," he noted. "In the interim, all manual processes involving certificates have been reviewed with the teams. We will examine our certificates and look for opportunities to partition the certificates across a service, across regions and across time so an uncaught expiration does not create a widespread, simultaneous event. And, we will continue to review the system and address any single points of failure."

Posted by Jeffrey Schwartz on 03/04/2013 at 1:15 PM


Featured

  • Gears

    Top 10 Microsoft Tips and Analyses of 2018

    Here are the year's most popular explainers and how-to columns -- along with some plain, old "Why did Microsoft do that?" musings thrown in.

  • Sign

    2018 Microsoft Predictions Revisited

    From guessing the fate of Windows 10 S to predicting Microsoft's next big move with Linux, Brien's predictions from a year ago were on the mark more than they weren't.

  • Microsoft Recaps Delivery Optimization Bandwidth Controls for Organizations

    Microsoft expects organizations using its Delivery Optimization peer-to-peer update scheme will optimally see 60 percent to 70 percent improvements in terms of network bandwidth use.

  • Getting a Handle on Hyper-V Virtual NICs

    Hyper-V usually makes it easy to configure virtual network adapters within VMs. That is, until you need to create a VM containing multiple virtual NICs.

comments powered by Disqus
Most   Popular

Office 365 Watch

Sign up for our newsletter.

Terms and Privacy Policy consent

I agree to this site's Privacy Policy.