Joey on SQL Server

Office 365 Email Outage Highlights Limits of Cloud Resilience Planning

A recent Exchange Online disruption tied to Microsoft network changes underscores how deeply businesses still rely on email -- and how few practical options organizations have to protect critical workflows when cloud services fail.

The goal of operations teams is to ensure the stability of their services and applications. Anyone versed in modern site reliability engineering (SRE) techniques will tell you that this goal runs counter to product development teams' goals, which are to add fixes and features to their service. Hyperscalers have modified their deployment patterns to limit risk and the "blast radius" (a dramatic way of saying how much stuff goes down if this update blows up) of any changes by using ring-based deployment methods. Ring-based deployment methods typically start with internal services, migrate to smaller regions, then to larger ones and finally to the global deployment.

Of course, not all kinds of updates support ring-based deployments, and some services have higher blast radii than others. Working on storage and edge network infrastructure tends to run some of the highest risks, because they impact so many underlying services.

Last week, Microsoft was trying to improve network routing infrastructure and, in doing so, broke an important service called the Global Locator Service (GLS), which locates the correct Exchange tenant and provides service infrastructure mapping. Microsoft underestimated the load balancers fronting those services, leading them to become unhealthy. This ultimately led to a cascading failure, taking down most of North America's Office 365 email services for approximately three to five hours.

If you had asked me even three to four years ago what the most important system in any given company was, I would have said email. While persistent chat systems like Teams and Slack have reduced the importance of email in many organizations, email is still at the heart of many businesses. While preparing this column, I learned from a colleague that a dentist's office couldn't receive X-rays during last week's outage. Email is still at the heart of many business workflows. From a system architect's perspective, while email is easy to make highly available (Exchange itself runs on a database availability group, which allows it to span multiple nodes), it's hard to protect against disasters.

Like everything else in information technology, the biggest challenges to moving systems across geographies are data volume and everyone's favorite point of failure -- DNS. While it would be trivial to move your mail DNS record (called a mail exchanger (MX) record) to another email provider, in the age of complex anti-spam systems, it's not quite that simple. Additionally, even if you could quickly move email to a new service that used your same authentication, so you didn't have to send new authentication info to your customers, they wouldn't have access to any of their existing emails or calendar information, rendering their email experience useless.

What are organizations to do to protect against email outages? My biggest recommendation is to have a fallback plan for critical processes that are driven by email. Most modern monitoring solutions support Teams/Slack, SMS, push notifications and even voice messages as fallbacks to email. Similar thought needs to be given to application data workflows that are solely dependent on email. In terms of user communication, your users do have a fallback -- Teams stayed online throughout this Exchange Online outage, so users could still communicate with their co-workers and, depending on your federation configuration, maybe even partners in other organizations. If you are not currently federated with key business partners, you should consider that Teams configuration.

Third-party vendor Mimecast offers a business continuity solution for Office 365 that enables users to access their mailboxes and attachments during outages. Mimecast also provides a send and delivery queue through its web portal, allowing users to send and receive email in near real-time. This requires changing email routing so it runs first through Mimecast's MX record, then points delivery and outbound routing to Office 365. Obviously, this comes at an additional cost beyond your Office 365 subscription.

Another option for enterprises, or multinational corporations, is to isolate their Office 365 users into separate tenants based on geography. You may already need to do this for data sovereignty reasons; however, it also reduces your exposure to a large-scale outage. Last week's email outage affected only North American users, based on my experience and discussions with European colleagues during the outage. Identifying the region of your Office 365 is not as simple as it is with Azure resources -- consult this document for more details. You will need to have the "multi-geo" add-on to implement this design within a single organization.

None of these solutions is perfect, complete, or easy. Also, they all cost extra money beyond your additional Office 365 licensing. While cloud vendors like Microsoft and AWS offer financially backed service-level agreements (SLAs), the refunds customers receive are often tiny compared to potential business losses. Ultimately, organizations have become wholly dependent on cloud providers and are at their mercy. Managing email at any scale is expensive, time-consuming and a security risk. And the costs/benefits of moving to a provider like Office 365 or Gmail always make sense.

In the early days of cloud, architects and developers were trained to assume failure and to develop applications that could tolerate transient failures. In fact, the site reliability engineering book I mentioned earlier illustrates the concept of "downtime budgets" and states that services should explicitly account for them so that other services can handle partial failures.  That theory is all well and good for organizations developing software and with SRE teams, but it probably doesn't help the dentist's office that needs to get X-rays via email.

I always like to offer solutions when I write a column -- as an architect, that's generally my job. This is one of those scenarios where there aren't many good answers. The types of firms that are going to feel the most pain from a cloud email outage are exactly the type of organizations that shouldn't manage their own email servers. The bigger concern is the pattern of increased outages in Azure and Office 365 as Microsoft increases its AI focus. I must wonder if shifting so many financial and development resources towards the various Copilot efforts has not impacted reliability engineering. While there's nothing you can really do about a provider-driven email outage, you should use this time to think about what business processes you have that are solely dependent on email and evaluate the ways you might be able to use another communications medium to make those workflows more robust.

About the Author

Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.

Featured

comments powered by Disqus

Subscribe on YouTube