Joey on SQL Server
Microsoft Fabric's Availability Gaps Erode Trust Amid Repeated Outages
Despite its SaaS promise, Microsoft Fabric has faced multiple unaddressed outages in recent months, prompting calls for better transparency, postmortem reporting and official communication from IT professionals left in the dark.
- By Joey D'Antoni
- 06/20/2025
There's a saying in American football circles that says, "The best ability is availability." The notion here is that a player, no matter how talented, can't make an impact unless they are actually on the field. Computer systems have the same concern -- no matter how impactful a system is, if it suffers from outages and downtime, it can't help the business that purchased it. Designing highly available systems has always been one of my interests in computing. It has given rise to a dedicated field known as site reliability engineering (SRE).
One of the common questions I discuss with customers about managed services is, "What happens if the service goes down?" That is when I typically discuss the high availability architecture of the service and review the detailed outage history that the provider supplies. In these cases, I explain to my customers how the outages can occur, how they can protect themselves from them and what the costs will be. For this process to work, service vendors need to be upfront about their architecture, build highly available solutions and provide support mechanisms for disaster recovery in the event of failures outside of their control. Finally, it requires the cloud provider to be fully transparent about outages, postmortems and service architectures.
This design means that customers have limited control over their availability model. The service model also puts the onus on Microsoft to provide high levels of availability and transparency. You can see how Microsoft details that for both Office 365 and Azure on this page. Note that you will have to log in and acknowledge an agreement to view the attestation documents. Fabric is a glaring absence from this page. Although it is a relatively new service, it has been generally available (GA) for over a year and a half, with many organizations relying on its services. I feel like the product's lack of maturity has been used an excuse for some of the outages and other features that haven't worked as expected.
In the last couple of months, Microsoft Fabric has had a series of three substantial outages. Customers still don't fully understand the root cause of these outages, except that they were related to new code releases and spread across multiple regions. Microsoft has not documented an official outage history or postmortems on official Web sites. There is a real-time outage dashboard that has been inconsistent in its operations; it eventually identified this week's outage but has exhibited erratic behavior. This pattern is not uncommon -- many cloud providers have struggled to provide status updates, as typically, the outage site shares infrastructure with the downed services. With Fabric, many users have turned to third-party sites like DownDetector or the Fabric subreddit to identify outages.
Fabric is packaged and marketed as a "software as a service" (SaaS) offering. What this means for both Fabric and most SaaS offerings is that customers have less granular control over the service. For example, Microsoft 365 does not offer an option to configure a secondary region for disaster recovery. Fabric supports DR across regions by deploying capacities across geographies. However, there is no complete support, nor is there a fully automated model for customers to easily and quickly failover, and several limitations exist. Like many systems, configuring DR for Fabric at least doubles the cost the costs to the customer. Disaster recovery storage for Fabric costs nearly twice as much per gigabyte than regular storage. Fabric does have inherent support for availability zones, but that is not universal across Fabric service, nor regions. While these limitations are common across many types of disaster recovery patterns, they are more concerning in a system where administrators have less control, such as Fabric.
Having limited failover options means that availability is fully in the hands of the service provider. This pattern means having a SaaS product like Fabric go down results in lots of unhappy business users who are screaming at IT professionals, who literally can only open a support ticket to Microsoft or complain on the subreddit. This places a higher burden of responsibility on the service provider to both ensure that systems are available and having frequent, clear communications, during any outages or service degradation.
The subreddit is another point of contention that I have with Fabric. While I appreciate having a direct channel to communicate with users, the challenge is that the subreddit has become the de facto official site for the product team to communicate with users. Senior leadership from Microsoft posted information about last month's outage to the subreddit. Still, it lacked detail and did not mention any corrective actions taken in response to the blackout. I have also seen undocumented information shared on the subreddit by product team members. Additionally, many users are not on Reddit, and some enterprises may even block the site on their firewalls. During outages, even if the status page is not functioning, there must be live communication from Microsoft through an official Microsoft channel (a status blog is also acceptable).
Having updated status is only one part of this equation—the additional part is explaining what happened in the outage. A recent example of this is the Google outage that occurred last week. As you can see, if you click through, the postmortem is detailed, tied to the downtime, and outlines both the problems and the corrective actions planned. I'm mentioning Google here because of its recency; however, for Azure outages, you can see the same sort of information. I would like to see Data Platform leadership commit to the same level of detail and information sharing for any outages or degradations on the Fabric platform.
The brief information Microsoft has shared has indicated that new code deployments caused these Fabric outages. Fabric does have a rollout deployment strategy that aligns with Microsoft's best practices. However, during the recent outages, the pattern that outages spread (admittedly based on online reports) does not align with the type of staggered rollout described in the linked document. Microsoft could efficiently address this concern by sharing detailed postmortems publicly.
I don't mean to slight anyone's hard work in writing this -- we all know how much time and effort goes into building robust software solutions. Fabric is a complex product that attempts to cover more ground than many products have in the past, so there will be challenges. However, with repeated outages, combined with many of the existing challenges Fabric has faced since its inception, many customers have begun to lose trust in the service. Fabric can help overcome these concerns by being both more transparent and delivering higher-quality services.
About the Author
Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.