Joey on SQL Server
Microsoft Fabric's Global Outage Leaves Customers Waiting for Answers
A week after Microsoft Fabric's worldwide network-access issue, customers still have little information about what caused the outage or how Microsoft plans to prevent a repeat.
- By Joey D'Antoni
- 05/26/2026
Building large-scale software-as-a-service applications is very hard. You must balance security, performance, feature requests, and infrastructure to build a robust product while maintaining viable profit margins. Building a successful product has become even more challenging amid workforce reductions, as many large software vendors have made major capital investments in AI. With all that in mind, SaaS vendors still need to deliver robust, highly available, secure products to their customers. Those customers build business processes around SaaS services, whether it's customer relationship management systems like Salesforce or Dynamics, or data analytics solutions like Microsoft Fabric or Databricks.
Outages are always going to happen -- no matter how much we engineer and architect, there is nearly always a single point of failure that gets overlooked during design and testing. Having friends and colleagues who work at these vendors, I'm always sympathetic to the challenges and pressure they face, and I can understand how bugs and mistakes can happen, even if they look glaringly obvious to an outsider. That extra node the architect wanted to provide an extra degree of reliability? Maybe it got killed by the business planning team to gleam a few extra percentage points of margin on the service. That extra testing cycles the engineering team wanted before taking the feature live? It got eliminated because the product manager and senior executives wanted to launch a feature at the next keynote.
What I can't tolerate is when the cloud or SaaS provider has an outage and isn't completely transparent about what happened and how they plan to fix it. For example, last fall, Azure Front Door (a networking and content delivery networking service) suffered two nearly back-to-back global failures. Within hours, we had a preliminary incident report, and within a week, the team reported back with a plan to prevent any subsequent outages and another to make the service and recovery more robust. Likewise, a widespread Azure outage in the East US 2 region a couple of weeks ago had a preliminary incident report (PIR) the day the event happened, and a large-scale public report on both the outage and the planned remediations.
Last Monday (May 18th), Microsoft Fabric had a global outage. The service itself mostly remained up, but the network front end was completely inaccessible worldwide. In the initial incident report on the status page, the Fabric team reported an unexpected increase in traffic, leading to "intermittent" network outages. The outage lasted approximately six to eight hours. Frustratingly, in both public and private channels, others and I have reached out to Microsoft to request a PIR (root cause) and have received no response.
Editor note: Shortly before we published this article, Microsoft published a PIR for the Fabric outage. Please note this link requires a high level of Office 365 privileges to access. The PIR reiterated the root cause being a DDOS attack. A Word doc attached the PIR proposed some mitigations against similar attacks, but with no timelines for implementation. Note: An Iranian threat actor group took credit for last week’s DDOS attack.
I don't want to dig too deeply into the technical details -- the lack of response is by far the worst part of this event. But this is my column, so let's dig into some possibilities. One thing that's fairly unique about fabric is that, from a DNS perspective, it uses a single URL as the entry point for all users: app.fabric.microsoft.com. Having a single subdomain as a point of entry is not inherently a single point of failure—DNS redirection services like Azure Traffic Manager allow that single URL to point to multiple endpoints globally. Microsoft has never been fully transparent about Fabric's physical architecture, which is their right, but they do mention the use of Front Door and protection against DDOS attacks in this design document. Cloud providers also use a technique called anycast routing, in which the same IP address is routed from locations around the world.
From doing an nslookup on app.fabric.microsoft.com, which is a CNAME that maps to app.powerbi.com and ultimately to an Azure Traffic Manager endpoint, which, based on my testing, routes the user to the nearest Fabric front-end endpoint. With that basic architecture in mind, I'm less clear as to how a global outage like this could happen.
Let's talk about what happens during a DDOS attack. Threat actors will build a botnet, a large group of previously compromised devices. These can be cloud endpoints, IoT devices, PCs, or even network hardware like routers—basically anything where the bad guys can execute scripts. The attackers then try to send a series of small packets to the public endpoints of services such as the Domain Name System (DNS), the Network Time Protocol (NTP), and ping, among others. While the attackers' request is very small, the service returns more data, eventually overwhelming it at scale. For a quick example below, I'm using the ping command to hit the Fabric endpoint:
Figure 1.
I'm sending 56 bytes initially, and each response from my request is 64 bytes. For the 56 bytes I sent, the service has returned 320. Multiply that sort of request by tens of thousands, and you can see how this attack could take an unprotected service down. In this case, we're connecting to a Microsoft-owned IP address in Virginia (please excuse the terrible latency, I retook this screenshot from a plane).
These attacks are why almost every commercial website employs some form of DDOS protection. The providers (including AWS, Microsoft, and Cloudflare) have something most websites do not -- an abundance of available bandwidth (hundreds of terabits per second). When the services detect an attacking pattern, they spread traffic across various black-holed (non-reachable) endpoints before quickly blocking the IP addresses of the attacking botnet, allowing the protected services to operate normally. There is a bit of back-and-forth here between attackers and defenders, as attackers try to use larger botnets and less detectable attack techniques, while defenders use machine learning to more readily identify attack patterns.
What happened with Fabric last Monday? We simply don't know. It's been a week, and Microsoft has said virtually nothing beyond initial messaging during the outage.
This lack of transparency is not new with the Fabric team and continues to be a frustration with my customers. Organizations already have limited options to make Fabric more resilient, with higher availability and disaster recovery. Limited regional service degradations in Fabric happen regularly and are not widely broadcast. These mini outages lead IT admins to question whether the problem lies in their code or in Fabric itself.
One of the pieces of advice I give customers is to use multiple capacities across Azure regions to mitigate some of that risk, but for last week's outage, the network endpoint was unavailable globally, so having a regional spread wouldn't have helped. I feel like a broken record when I say I'd like to see the Fabric team be more like Azure, but once again, a couple of weeks ago, we had a regional outage in Azure, and had PIR within a day, and here we are more than a week after a global Fabric outage, and we have virtually no information.
About the Author
Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.