The Irony Behind the Windows Azure Meltdown
Just as I was getting ready to call it a week late Friday afternoon, Microsoft's Windows Azure cloud storage service went down worldwide. As I reported, Windows Azure storage was unavailable because of an expired SSL certificate.
The global outage of Windows Azure late Friday into Saturday is ironic, considering the release of last week's study that Windows Azure storage offered the fastest response times of five large cloud networks -- namely those operated by Amazon Web Services, Google, HP and Rackspace. Good thing for Microsoft that Nasuni, the vendor that ran the shootout, wasn't testing Windows Azure at that time.
Once the service was back up Saturday, I posted an update noting that Microsoft had fixed the problem and users could once again access their data. The company said it was 99 percent available early Saturday and completely restored by 8 p.m. PST. But the damage was already done and many customers and partners were furious.
In comments posted on a Windows Azure forum, Sepia Labs' Brian Reischl, who first pointed to the SSL certificate as the likely culprit, seemed to feel users should cut Microsoft some slack. Reischl said letting an SSL certificate fall through the cracks is a mistake anyone could make. "I know I have. It's easy to forget, right?," he posted. "It's an amateur mistake, but it happens. You end up with some egg on your face, add a calendar reminder for next year, and move on."
But one has to wonder how Microsoft, which has staked its future on the cloud and has spent billions to build Windows Azure into one of the largest global cloud services, could not have put in safeguards to prevent the domino effect that occurred when that cert expired, much less having a mechanism in place to know when all certificates are about to expire. Putting it in admins Outlook calendars would be a good start.
Of course there are more sophisticated tools to make sure SSL certificates don't expire. Among them are Solar Winds' certificate monitoring and expiration management component of its Server & Application Monitor, a Redmond reader favorite. Another option not so coincidently hit my inbox this morning. Matt Watson founder of Stackify, spent a few hours over the weekend developing a free tool called CertAlert.me, which allows a site owner to scan the Web sites its owns and track SSL and domain name expirations.
"It happens a lot," Watson told me in a brief telephone conversation regarding outages such as the one that struck Friday, which affected Stackify. "All you can do is sit on your hands and pray," he said, adding years ago he had to deal with an expired SSL certificate. "You buy them and you forget about them and the next thing you know your site's gone. It's one of those things that get overlooked."
Asked what's the business opportunity for offering this free service, Watson said he saw it as an opportunity to bring exposure to the startup's namesake offering, a Windows Azure-based server monitoring platform targeted at easing access for developers while ensuring they don't have access to production systems.
Indeed you can bet Microsoft is going to ensure it doesn't happen. "Our teams are also working hard on a full root cause analysis (RCA), including steps to help prevent any future reoccurrence," said Steven Martin, Microsoft's general manager of Windows Azure business and operations, in a blog post apologizing for the disruption. Given the scope of the outage, Microsoft will offer credits in conformance with its SLAs, Martin said.
This is not the first outage Microsoft has had to explain and probably won't be the last. And we all know the number of well-publicized outages Amazon Web Services has encountered in recent years.
If you're a Windows Azure customer, did last week's slipup erode your confidence in storing your data in Microsoft's cloud? Drop me a line at [email protected].
Note: This post was updated to clarify hat the Windows Azure outage affected Stackify.
Posted by Jeffrey Schwartz on 02/25/2013 at 1:15 PM