The Schwartz Report

Blog archive

Breakdown in Procedures Caused Azure Outage

When Microsoft's Windows Azure cloud storage service went down worldwide late last month, the company confirmed within a few hours the cause of the massive meltdown.  An expired SSL certificate crippled the service late Friday Feb. 22 into the next day.

Furious customers wanted to know how something as simple as renewing a SSL cert could fall through the cracks. Even worse, how could that become a single point of failure capable of bringing down the entire service throughout the world?  It turns out the cause was a "breakdown in procedures," according to Mike Neil, general manager for Microsoft's Windows Azure service, in a contrite post-mortem posted Friday detailing the cause of the outage and plans to ensure the error isn't repeated.

 "A breakdown in our procedures for maintaining and monitoring these certificates was the root cause," Neil noted. "Additionally, since the certificates were the same across regions and were temporally close to each other, they were a single point of failure for the storage system."

Neil explained that Windows Azure has an internal service called the Secret Store that manages hundreds of certificates used to securely run the cloud service. The Secret Store alerted the team on Jan. 7 that the blob, queue and table certificates would expire on Feb. 22. It turns out the storage team did update the certificates but failed to flag a storage service release as one that included the updated certs.

"Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline," Neil explained. "Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system." Hence, that's how it fell through the cracks.

So what's Microsoft doing to ensure this doesn't happen again? Neil said the Windows Azure team will improve the process for detecting certificates that need to be renewed. Production certs due to expire in less than three months will generate and operational incident and will be tracked as what he termed a Service Impacting Event.

"We will also automate any associated manual processes so that builds of services that contain certificate updates are tracked and prioritized correctly," he noted. "In the interim, all manual processes involving certificates have been reviewed with the teams. We will examine our certificates and look for opportunities to partition the certificates across a service, across regions and across time so an uncaught expiration does not create a widespread, simultaneous event. And, we will continue to review the system and address any single points of failure."

Posted by Jeffrey Schwartz on 03/04/2013 at 1:15 PM


Featured

  • Spaceflight Training in the Middle of a Pandemic

    Surprisingly, the worldwide COVID-19 lockdown has hardly slowed down the space training process for Brien. In fact, it has accelerated it.

  • Surface and ARM: Why Microsoft Shouldn't Follow Apple's Lead and Dump Intel

    Microsoft's current Surface flagship, the Surface Pro X, already runs on ARM. But as the ill-fated Surface RT showed, going all-in on ARM never did Microsoft many favors.

  • IT Security Isn't Supposed To Be Easy

    Joey explains why it's worth it to endure a little inconvenience for the long-term benefits of a password manager and multifactor authentication.

  • Microsoft Makes It Easier To Self-Provision PCs via Windows Autopilot When VPNs Are Used

    Microsoft announced this week that the Windows Autopilot service used with Microsoft Intune now supports enrolling devices, even in cases where virtual private networks (VPNs) might get in the way.

comments powered by Disqus

Office 365 Watch

Sign up for our newsletter.

Terms and Privacy Policy consent

I agree to this site's Privacy Policy.