The Schwartz Report

Blog archive

Breakdown in Procedures Caused Azure Outage

When Microsoft's Windows Azure cloud storage service went down worldwide late last month, the company confirmed within a few hours the cause of the massive meltdown.  An expired SSL certificate crippled the service late Friday Feb. 22 into the next day.

Furious customers wanted to know how something as simple as renewing a SSL cert could fall through the cracks. Even worse, how could that become a single point of failure capable of bringing down the entire service throughout the world?  It turns out the cause was a "breakdown in procedures," according to Mike Neil, general manager for Microsoft's Windows Azure service, in a contrite post-mortem posted Friday detailing the cause of the outage and plans to ensure the error isn't repeated.

 "A breakdown in our procedures for maintaining and monitoring these certificates was the root cause," Neil noted. "Additionally, since the certificates were the same across regions and were temporally close to each other, they were a single point of failure for the storage system."

Neil explained that Windows Azure has an internal service called the Secret Store that manages hundreds of certificates used to securely run the cloud service. The Secret Store alerted the team on Jan. 7 that the blob, queue and table certificates would expire on Feb. 22. It turns out the storage team did update the certificates but failed to flag a storage service release as one that included the updated certs.

"Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline," Neil explained. "Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system." Hence, that's how it fell through the cracks.

So what's Microsoft doing to ensure this doesn't happen again? Neil said the Windows Azure team will improve the process for detecting certificates that need to be renewed. Production certs due to expire in less than three months will generate and operational incident and will be tracked as what he termed a Service Impacting Event.

"We will also automate any associated manual processes so that builds of services that contain certificate updates are tracked and prioritized correctly," he noted. "In the interim, all manual processes involving certificates have been reviewed with the teams. We will examine our certificates and look for opportunities to partition the certificates across a service, across regions and across time so an uncaught expiration does not create a widespread, simultaneous event. And, we will continue to review the system and address any single points of failure."

Posted by Jeffrey Schwartz on 03/04/2013 at 1:15 PM


comments powered by Disqus

Reader Comments:

Sat, Jan 18, 2014

(Edit)Adding certificate: Microsoft Certified Systems Engineer Certification (MCSE 2003), with the foiwollng exams: 70-294 (Planning, Implementing, and Maintaining a Windows Server 2003 Active Directory Infrastructure)Reply

Tue, Mar 5, 2013 MPD East Coast

Why do these certs expire? Why not install long lived certs that expire 10 years from now?

Mon, Mar 4, 2013 ibsteve2u Commonwealth of Pennsylvania

@Mel & Omah, who said "The cert guy probably got laid off last go round. Ooops." Nah - ponder the sentence fragment "We will also automate any associated manual processes..." - the cert guy(s) will get laid off NEXT go round. (As half of the malware community is drooling over the possibility of taking advantage of that automation to propagate bad certs throughout Microsoft.)

Mon, Mar 4, 2013

Another reason to avoid the cloud. Chances are your organization, thus your IT infrastructure is a small fraction of the size of a company like Microsoft, Amazon, etc. In many ways, the more eggs in your basket, the harder it is to keep track of them.

Mon, Mar 4, 2013 Mel & Omah

The cert guy probably got laid off last go round. Oops.

Add Your Comment Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above

Redmond Tech Watch

Sign up for our newsletter.

I agree to this site's Privacy Policy.