Microsoft Offers Explanations for Lync and Exchange Service Outages -- Redmondmag.com

Microsoft Offers Explanations for Lync and Exchange Service Outages

By Kurt Mackie
06/27/2014

Microsoft provided a somewhat more detailed public description of its Office 365 service outages that occurred earlier this week.

Rajesh Jha, corporate vice president for Office 365 engineering, described the two separate incidents that affected Lync Online users on Monday (June 23), as well as Exchange Online users on Tuesday (June 24). The service outages just affected Microsoft's North American datacenters and the problems causing the outages have since been fixed, he explained in a Microsoft forum post.

With regard to the Lync Online problem, some users in North America were affected and couldn't log into the service. Microsoft fixed that specific log-in problem "in minutes," Jha said, but that "the ensuing traffic spike caused several network elements to get overloaded, resulting in some of our customers being unable to access Lync functionality for an extended duration." That extended duration appears to have been a good part of the working day on June 23, according to a chronicle kept by veteran Microsoft reporter Mary Jo Foley.

The Exchange Online outage also seems to have been a small problem that just escalated after being detected. Jha explained that a directory partition stopped responding to authentication requests. That problem caused "a small set of customers to lose email access." However, the problem somehow affected Microsoft's broader e-mail traffic flow. Many Exchange Online users reported not being able to send or receive e-mail. Jha said that the initial Exchange Online failure led to an "unexpected issue":

Unfortunately, the nature of this failure led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw leading to mail flow delays for a larger set of customers. Our recovery strategy was two pronged: 1) We partitioned the mail delivery system away from the failed directory partition and 2) directly addressed the root cause for the failed directory partition. In addition to fixing the root cause trigger, we are working on further layers of hardening for this pattern.

The Exchange Online problem persisted through most of the day on June 24. Jha also noted that the Service Health Dashboard, which provides Office 365 service uptime reports to subscribers, had a problem with its "publishing process, meaning not all impacted customers were notified in a timely way." He said that the problem with the Service Health Dashboard has "since been addressed."

Microsoft plans to provide more details about the outages to its customers via a "post-incident report," which will appear in the Service Health Dashboard, Jha said. Microsoft doesn't have a publicly accessible portal showing its Office 365 service health, and so much of the news about the outages on Monday and Tuesday were initially relayed through Twitter posts.

Microsoft offers a "three nines" or 99.9 percent uptime service level agreement as part of its Office 365 business plans. If Microsoft fails to meet a 99.9 percent uptime each month, then the subscriber may be eligible to get a service credit. However, the subscriber has to file with Microsoft to get the credit. The service credit is calculated as a percentage of the monthly service fees that gets returned to the customer, depending on the degradation of service uptime. Microsoft shows those uptime percentages and corresponding service credits in the following table:

Monthly Uptime	Service Credit
< 99.9%	25%
< 99%	50%
< 95%	100%

Table 1. Service credit percentages based on monthly Office 365 uptime. Source: Microsoft's "Service Level Agreement for Microsoft Online Services" document.

It's estimated that a 99.9 percent uptime translates to experiencing about 43 minutes of downtime per month, or about eight hours of downtime per year. Microsoft's outages on Monday and Tuesday lasted perhaps six hours and nine hours, respectively, according to press reports.

About the Author

Kurt Mackie is senior news producer for 1105 Media's Converge360 group.

Featured

Datacenter Backlash May Be About AI Trust

The rapid buildout of AI infrastructure is facing growing public resistance, but a new survey suggests the opposition may be rooted less in local datacenter projects than in broader distrust of artificial intelligence, Big Tech and who stands to benefit from the AI boom.
Let Copilot Manage Your Outlook Calendar, Part 1

Microsoft 365 Copilot's new Calendar Instructions feature lets Outlook users create rules for automatically accepting, declining or following meetings based on specific conditions.
Microsoft Disrupts StealC, Amadey Malware Infrastructure in AI-Assisted Cybercrime Action

Microsoft on Wednesday said it has disrupted infrastructure tied to StealC and Amadey, two widely used cybercrime tools that the company says have become part of a broader malware supply chain used to steal credentials, support fraud and enable ransomware attacks.
Microsoft Makes Point-in-Time Restore Generally Available for Windows 11

Microsoft has made point-in-time restore generally available for Windows 11, giving users and IT administrators a built-in way to roll back PCs after bad updates, driver problems, app corruption or other disruptions.
PowerShell Trusted Hosts for Mixed Environments

Trusted host lists can help keep PowerShell remoting working in mixed domain and workgroup environments, but only if admins avoid overwriting existing WinRM settings.