The Schwartz Cloud Report

Blog archive

Microsoft Promises Better Communication After Azure Leap Day Outage

It was bad enough that Microsoft's Windows Azure cloud service was unavailable for much of the day on Feb. 29 thanks to the so-called Leap Day bug. But customers struggled to find out what was going on and when service would be restored.

That's because the Windows Azure Dashboard itself wasn't fully available, noted Bill Laing, corporate VP of Microsoft's Server and Cloud division, in a blog post Friday, where he provided an in-depth post-mortem with extensive technical details outlining what went wrong. In very simple terms, it was the result of a coding error that led the system to calculate a future date that didn't exist.

But others may be less interested in what went wrong than in how reliable Windows Azure and public cloud services will be over the long haul. On that front, Laing was pretty candid, saying, "The three truths of cloud computing are: hardware fails, software has bugs and people make mistakes. Our job is to mitigate all of these unpredictable issues to provide a robust service for our customers."

Did Microsoft do enough to mitigate this issue? Laing admits Microsoft could have done better to prevent, detect and respond to the problems. In terms of prevention, Microsoft said it will improve testing to discover time-related bugs by upgrading its code analysis tools to uncover those and similar types of coding issues. The problem took too long -- 75 minutes -- to detect, Laing added, noting the specific issue regarding detecting fault with the guest agent where the bug was found.

Exacerbating the whole matter was the breakdown in communication. The Windows Azure Service Dashboard failed to "provide the granularity of detail and transparency our customers need and expect," Laing said. Hourly updates failed to appear and information on the dashboard lacked helpful insight, he acknowledged.

"Customers have asked that we provide more details and new information on the specific work taking place to resolve the issue," he said. "We are committed to providing more detail and transparency on steps we're taking to resolve an outage as well as details on progress and setbacks along the way."

Noting that customer service telephone lines were jammed due to the lack of information on the dashboard, Laing promised users will not be kept in the dark. "We are reevaluating our customer support staffing needs and taking steps to provide more transparent communication through a broader set of channels," he said. Those channels will include Facebook and Twitter, among other forums.

Windows Azure customers affected by the outage will receive a 33 percent credit, which will automatically be applied to their bills. However, such credits, while welcome, rarely make up for the cost associated with downtime. But if Microsoft delivers on Laing's commitments, perhaps the next outage will be less painful.

See Also:

Posted by Jeffrey Schwartz on 03/14/2012 at 1:14 PM


comments powered by Disqus

Reader Comments:

Wed, Mar 14, 2012

Just an FYI to possibly avoid an upcoming problem: there will be a "leap second" added to the last minute of the day on 30 June 2012, the first since 2009 and the 25th since the practice started in 1972. That is, 23:59:60 UTC will be a valid time on that date and the start of 1 July will be delayed by that one second. [Leap seconds are required to keep UTC approximately in sync with solar time: whenever UTC will vary from actual astronomical mean solar time by more than 1/2 second, due to changes in earth's rotation rate, UTC is adjusted in increments of one second, and (so far) always on either 31 December or 30 June.] To my knowledge, no Microsoft software properly tracks or handles leap seconds -- not Windows, not SQL Server, not .NET, and therefore probably not Azure. While they are adding what should have been obvious test cases to validate leap year code, they might also want to add a test case for a pre-announced 61-second minute (or even a 59 second minute - leap seconds can be negative, although none have been so far) happening at irregular intervals.

Add Your Comment Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above

Redmond Tech Watch

Sign up for our newsletter.

I agree to this site's Privacy Policy.