Microsoft Fabric's Global Outage Leaves Customers Waiting for Answers -- Redmondmag.com

Microsoft Fabric's Global Outage Leaves Customers Waiting for Answers

A week after Microsoft Fabric's worldwide network-access issue, customers still have little information about what caused the outage or how Microsoft plans to prevent a repeat.

By Joey D'Antoni
05/26/2026

Building large-scale software-as-a-service applications is very hard. You must balance security, performance, feature requests, and infrastructure to build a robust product while maintaining viable profit margins. Building a successful product has become even more challenging amid workforce reductions, as many large software vendors have made major capital investments in AI. With all that in mind, SaaS vendors still need to deliver robust, highly available, secure products to their customers. Those customers build business processes around SaaS services, whether it's customer relationship management systems like Salesforce or Dynamics, or data analytics solutions like Microsoft Fabric or Databricks.

Outages are always going to happen -- no matter how much we engineer and architect, there is nearly always a single point of failure that gets overlooked during design and testing. Having friends and colleagues who work at these vendors, I'm always sympathetic to the challenges and pressure they face, and I can understand how bugs and mistakes can happen, even if they look glaringly obvious to an outsider. That extra node the architect wanted to provide an extra degree of reliability? Maybe it got killed by the business planning team to gleam a few extra percentage points of margin on the service. That extra testing cycles the engineering team wanted before taking the feature live? It got eliminated because the product manager and senior executives wanted to launch a feature at the next keynote.

What I can't tolerate is when the cloud or SaaS provider has an outage and isn't completely transparent about what happened and how they plan to fix it. For example, last fall, Azure Front Door (a networking and content delivery networking service) suffered two nearly back-to-back global failures. Within hours, we had a preliminary incident report, and within a week, the team reported back with a plan to prevent any subsequent outages and another to make the service and recovery more robust. Likewise, a widespread Azure outage in the East US 2 region a couple of weeks ago had a preliminary incident report (PIR) the day the event happened, and a large-scale public report on both the outage and the planned remediations.

Last Monday (May 18th), Microsoft Fabric had a global outage. The service itself mostly remained up, but the network front end was completely inaccessible worldwide. In the initial incident report on the status page, the Fabric team reported an unexpected increase in traffic, leading to "intermittent" network outages. The outage lasted approximately six to eight hours. Frustratingly, in both public and private channels, others and I have reached out to Microsoft to request a PIR (root cause) and have received no response.

Editor note: Shortly before we published this article, Microsoft published a PIR for the Fabric outage. Please note this link requires a high level of Office 365 privileges to access. The PIR reiterated the root cause being a DDOS attack. A Word doc attached the PIR proposed some mitigations against similar attacks, but with no timelines for implementation. Note: An Iranian threat actor group took credit for last week’s DDOS attack.

I don't want to dig too deeply into the technical details -- the lack of response is by far the worst part of this event. But this is my column, so let's dig into some possibilities. One thing that's fairly unique about fabric is that, from a DNS perspective, it uses a single URL as the entry point for all users: app.fabric.microsoft.com. Having a single subdomain as a point of entry is not inherently a single point of failure—DNS redirection services like Azure Traffic Manager allow that single URL to point to multiple endpoints globally. Microsoft has never been fully transparent about Fabric's physical architecture, which is their right, but they do mention the use of Front Door and protection against DDOS attacks in this design document. Cloud providers also use a technique called anycast routing, in which the same IP address is routed from locations around the world.

From doing an nslookup on app.fabric.microsoft.com, which is a CNAME that maps to app.powerbi.com and ultimately to an Azure Traffic Manager endpoint, which, based on my testing, routes the user to the nearest Fabric front-end endpoint. With that basic architecture in mind, I'm less clear as to how a global outage like this could happen.

Let's talk about what happens during a DDOS attack. Threat actors will build a botnet, a large group of previously compromised devices. These can be cloud endpoints, IoT devices, PCs, or even network hardware like routers—basically anything where the bad guys can execute scripts. The attackers then try to send a series of small packets to the public endpoints of services such as the Domain Name System (DNS), the Network Time Protocol (NTP), and ping, among others. While the attackers' request is very small, the service returns more data, eventually overwhelming it at scale. For a quick example below, I'm using the ping command to hit the Fabric endpoint:

I'm sending 56 bytes initially, and each response from my request is 64 bytes. For the 56 bytes I sent, the service has returned 320. Multiply that sort of request by tens of thousands, and you can see how this attack could take an unprotected service down. In this case, we're connecting to a Microsoft-owned IP address in Virginia (please excuse the terrible latency, I retook this screenshot from a plane).

These attacks are why almost every commercial website employs some form of DDOS protection. The providers (including AWS, Microsoft, and Cloudflare) have something most websites do not -- an abundance of available bandwidth (hundreds of terabits per second). When the services detect an attacking pattern, they spread traffic across various black-holed (non-reachable) endpoints before quickly blocking the IP addresses of the attacking botnet, allowing the protected services to operate normally. There is a bit of back-and-forth here between attackers and defenders, as attackers try to use larger botnets and less detectable attack techniques, while defenders use machine learning to more readily identify attack patterns.

What happened with Fabric last Monday? We simply don't know. It's been a week, and Microsoft has said virtually nothing beyond initial messaging during the outage.

This lack of transparency is not new with the Fabric team and continues to be a frustration with my customers. Organizations already have limited options to make Fabric more resilient, with higher availability and disaster recovery. Limited regional service degradations in Fabric happen regularly and are not widely broadcast. These mini outages lead IT admins to question whether the problem lies in their code or in Fabric itself.

One of the pieces of advice I give customers is to use multiple capacities across Azure regions to mitigate some of that risk, but for last week's outage, the network endpoint was unavailable globally, so having a regional spread wouldn't have helped. I feel like a broken record when I say I'd like to see the Fabric team be more like Azure, but once again, a couple of weeks ago, we had a regional outage in Azure, and had PIR within a day, and here we are more than a week after a global Fabric outage, and we have virtually no information.

About the Author

Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.

Featured

Why Azure SQL Database Hyperscale Is Not Just for Massive Workloads

Hyperscale combines strong write performance, flexible storage and fast replica creation for databases of nearly any size.
HOLLOWGRAPH Malware Turns Microsoft 365 Calendars Into Covert Attack Channels

The targeted espionage tool hides commands and stolen files inside calendar events while using legitimate Microsoft cloud traffic to evade detection.
Enterprise AI Agents Outpace the Content and Governance Systems Behind Them

AI agents have quickly moved into mainstream enterprise use, but the content infrastructure needed to support them has struggled to keep up, according to a new survey-based report from cloud content management company Box.
Phishing Isn't an Email Problem Anymore - It's an Identity Problem

Security teams have invested heavily in email protection, endpoint security and identity controls, but Fortra's latest research suggests one challenge remains difficult to solve: users.
Microsoft Expands Defender Experts With New Threat Intelligence and Multicloud Coverage

Microsoft on Wednesday introduced a threat intelligence service and expanded its managed detection and response offering as the company looks to help security teams face growing volume of threat data into specific defensive actions.