News

Microsoft Offers Explanation for Its Azure Outages

Microsoft has published an explanation for its widespread Azure outages that lasted up to 11 hours on Tuesday.

Azure components are now "working properly," Microsoft indicated yesterday. The one exception remains with Azure Virtual Machines in West Europe, where some virtual machines are reportedly in a "start state."

Microsoft's explanation for the outages is not too surprising. The basic story is that Azure had an unknown software flaw. The flaw was "discovered" after the rollout of a software patch to Microsoft Azure datacenters, affecting operations worldwide. That's the basic story, but Microsoft also admits that it didn't follow "the standard protocol of applying production changes in incremental batches" before rolling out its software patch globally.

The details are spelled out in a summary analysis by Jason Zander, corporate vice president on the Microsoft Azure team in a blog post late Thursday. He indicated that "a limited subset of customers are [sic] still experiencing intermittent issues." Zander promised that Microsoft would provide a more detailed "root cause analysis" of the incident later on.

Here's how the incident unfolded, according to Zander's post. First, Microsoft had an undiscovered flaw in the Blob table front ends of Azure. Unfortunately, an update that was expected to improve Azure Storage Services surfaced this flaw in the Blob table front ends. The flaw caused an infinite loop in which the Blob front ends stopped taking on traffic, affecting services worldwide. Quite a lot of Microsoft Azure services depend on Azure Storage Services, so the infinite loop problem affected related Azure services, such as Virtual Machines and Websites, among others. While Microsoft attempted to address the problem, doing so entailed restarting the Blob front ends, which further delayed the recovery.

As with past Azure outages, this outage affected the Service Health Dashboard, which is the portal that customers use for understanding the state of various Azure services. Microsoft couldn't update the Service Health Dashboard for three hours after the outage. Consequently, it used Twitter and other social media to report the problems, according to Zander's post.

The outage also affected the ability of Azure customers to use the Service Health Dashboard to actually report support cases "during the early phase of the outage," according to Zander.

Zander promised the following steps would be followed by Microsoft:

  • Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed.
  • Improve the recovery methods to minimize the time to recovery.
  • Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production.
  • Improve Service Health Dashboard Infrastructure and protocols.

In reaction to Microsoft's explanation, Microsoft MVP Aidan Finn offered thoughtful comments about Microsoft's production change process. He suggested in a Petri IT Knowledgebase article that IT pros maybe have a different perspective about updating systems than Microsoft's Azure developers. For instance, IT pros typically run tests of new software updates before delivering them to their production systems.

Finn also wondered about Microsoft's communications to its customers. While Microsoft used Twitter when its Service Health Dashboard wasn't functioning, Finn noted that Microsoft has the e-mail address of "every subscriber owner and delegate administrator."

About the Author

Kurt Mackie is senior news producer for 1105 Media's Converge360 group.

Featured

  • How To Improve Windows 10's Sound and Video Quality

    Windows 10 comes with built-in tools that can help users get the most out of their sound and video hardware.

  • Microsoft Offers More 'Solorigate' Advice Using Microsoft 365 Defender Tools

    Microsoft issued yet another article with advice on how to use its Microsoft 365 Defender suite of tools to protect against "Solorigate" advanced persistent threat types of attacks in a Thursday announcement.

  • Microsoft FastTrack Support Extended to Microsoft 365 Defender Solutions

    The Microsoft FastTrack support program has been extended to Microsoft 365 Defender products for certain qualified subscribers, Microsoft indicated this week.

  • Microsoft 365 File-Size Support Expanding to 250GB

    Microsoft 365 users will be getting expanded file-size support, allowing files to be shared that are 250GB maximum in size, per a Microsoft announcement this week.

comments powered by Disqus