Microsoft Offers Explanation for Its Azure Outages
Microsoft has published an explanation for its widespread Azure outages that lasted up to 11 hours on Tuesday.
Azure components are now "working properly," Microsoft indicated yesterday. The one exception remains with Azure Virtual Machines in West Europe, where some virtual machines are reportedly in a "start state."
Microsoft's explanation for the outages is not too surprising. The basic story is that Azure had an unknown software flaw. The flaw was "discovered" after the rollout of a software patch to Microsoft Azure datacenters, affecting operations worldwide. That's the basic story, but Microsoft also admits that it didn't follow "the standard protocol of applying production changes in incremental batches" before rolling out its software patch globally.
The details are spelled out in a summary analysis by Jason Zander, corporate vice president on the Microsoft Azure team in a blog post late Thursday. He indicated that "a limited subset of customers are [sic] still experiencing intermittent issues." Zander promised that Microsoft would provide a more detailed "root cause analysis" of the incident later on.
Here's how the incident unfolded, according to Zander's post. First, Microsoft had an undiscovered flaw in the Blob table front ends of Azure. Unfortunately, an update that was expected to improve Azure Storage Services surfaced this flaw in the Blob table front ends. The flaw caused an infinite loop in which the Blob front ends stopped taking on traffic, affecting services worldwide. Quite a lot of Microsoft Azure services depend on Azure Storage Services, so the infinite loop problem affected related Azure services, such as Virtual Machines and Websites, among others. While Microsoft attempted to address the problem, doing so entailed restarting the Blob front ends, which further delayed the recovery.
As with past Azure outages, this outage affected the Service Health Dashboard, which is the portal that customers use for understanding the state of various Azure services. Microsoft couldn't update the Service Health Dashboard for three hours after the outage. Consequently, it used Twitter and other social media to report the problems, according to Zander's post.
The outage also affected the ability of Azure customers to use the Service Health Dashboard to actually report support cases "during the early phase of the outage," according to Zander.
Zander promised the following steps would be followed by Microsoft:
- Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed.
- Improve the recovery methods to minimize the time to recovery.
- Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production.
- Improve Service Health Dashboard Infrastructure and protocols.
In reaction to Microsoft's explanation, Microsoft MVP Aidan Finn offered thoughtful comments about Microsoft's production change process. He suggested in a Petri IT Knowledgebase article that IT pros maybe have a different perspective about updating systems than Microsoft's Azure developers. For instance, IT pros typically run tests of new software updates before delivering them to their production systems.
Finn also wondered about Microsoft's communications to its customers. While Microsoft used Twitter when its Service Health Dashboard wasn't functioning, Finn noted that Microsoft has the e-mail address of "every subscriber owner and delegate administrator."
Kurt Mackie is senior news producer for 1105 Media's Converge360 group.