Breakdown in Procedures Caused Azure Outage -- Redmondmag.com

The Schwartz Report

Breakdown in Procedures Caused Azure Outage

When Microsoft's Windows Azure cloud storage service went down worldwide late last month, the company confirmed within a few hours the cause of the massive meltdown. An expired SSL certificate crippled the service late Friday Feb. 22 into the next day.

Furious customers wanted to know how something as simple as renewing a SSL cert could fall through the cracks. Even worse, how could that become a single point of failure capable of bringing down the entire service throughout the world? It turns out the cause was a "breakdown in procedures," according to Mike Neil, general manager for Microsoft's Windows Azure service, in a contrite post-mortem posted Friday detailing the cause of the outage and plans to ensure the error isn't repeated.

"A breakdown in our procedures for maintaining and monitoring these certificates was the root cause," Neil noted. "Additionally, since the certificates were the same across regions and were temporally close to each other, they were a single point of failure for the storage system."

Neil explained that Windows Azure has an internal service called the Secret Store that manages hundreds of certificates used to securely run the cloud service. The Secret Store alerted the team on Jan. 7 that the blob, queue and table certificates would expire on Feb. 22. It turns out the storage team did update the certificates but failed to flag a storage service release as one that included the updated certs.

"Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline," Neil explained. "Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system." Hence, that's how it fell through the cracks.

So what's Microsoft doing to ensure this doesn't happen again? Neil said the Windows Azure team will improve the process for detecting certificates that need to be renewed. Production certs due to expire in less than three months will generate and operational incident and will be tracked as what he termed a Service Impacting Event.

"We will also automate any associated manual processes so that builds of services that contain certificate updates are tracked and prioritized correctly," he noted. "In the interim, all manual processes involving certificates have been reviewed with the teams. We will examine our certificates and look for opportunities to partition the certificates across a service, across regions and across time so an uncaught expiration does not create a widespread, simultaneous event. And, we will continue to review the system and address any single points of failure."

Posted by Jeffrey Schwartz on 03/04/2013

Featured

Supply Chain Attack Hits Microsoft GitHub Repos, AI Coding Tools

GitHub disabled 73 Microsoft repositories on June 5 after a malicious commit landed in an Azure project, in what researchers described as a supply chain attack aimed at developer workstations and AI coding environments.
The 4 Microsoft Build 2026 Announcements That Matter Most

Microsoft Build 2026 showed how Redmond is tying its future to agentic AI, AI-native Windows development, scientific discovery and quantum computing.
Active Directory Basics Are Anything but Basic

Microsoft MVP Derek Melber explains why real AD knowledge depends on understanding how Group Policy, replication and DNS behave in production.
Data Hoarding: The Backup Problem that Nobody Wants to Admit To

Letting data pile up may feel safer than deleting it, but unchecked accumulation can make backups slower, costlier and harder to recover when something goes wrong.
Microsoft 365 Android Coding Error Put Account Tokens at Risk

A coding error in several Microsoft 365 Android apps could have allowed a malicious app on the same device to silently obtain account tokens and act as the signed-in user, according to new research from Enclave.