Joey on SQL Server
Microsoft Fabric Deep Dive: A Rare Look Under the Hood
Microsoft is normally tight-lipped about Fabric's architecture, but it briefly bucked tradition earlier this month.
- By Joey D'Antoni
- 10/23/2024
After speaking at the European Fabric Community Conference last month, I followed it up with the Dataminds Connect event in Mechelen, Belgium, in early October. As a longtime conference speaker, I usually don't attend many conference sessions, but both the Dataminds keynote and one later session caught my eye. In it, Conor Cunningham, partner architect at Microsoft, spoke about the internals of Microsoft Fabric.
I've written much about Microsoft Fabric here and elsewhere, but you may have noticed those columns have been light on technical details. Microsoft hasn't shared much about the back-end architecture of Fabric because of its "software as a service" nature. However, understanding how Fabric works under the hood can help administrators, architects and developers make the right decisions when it comes to optimizing performance.
A Little Bit of History
Before I dive into Cunningham's talk, it's essential to understand how we got here.
The history of data warehousing started with what's known as symmetric multiprocessing (SMP), which is a fancy way of describing a data warehouse running on a single server with multiple processors. As data volumes and workloads grew, the industry developed massively parallel processing (MPP) systems like Teradata, Parallel Data Warehouse and Exadata. MPP systems use many servers to distribute large query workloads, returning results to a central control server or node.
While MPP systems provided excellent performance, they were costly, required significant capital investments and stored data locally -- meaning, once you invested in one, it would be your data warehouse platform for a long time.
Two major technology trends have brought us to our current state: broader use of the open source data warehousing platform Hadoop in conjunction with cloud computing and object-based cloud storage like Amazon S3 and Azure Blob Storage. Hadoop was created at Yahoo when the cost of Oracle technologies to analyze massive volumes of weblogs became prohibitive. Hadoop's' most significant contribution was a distributed file system called HDFS, which became the basis for cloud object stores. However, Hadoop had the same data locality problem as MPP. It's always hard to move multiple terabytes of data from one cluster of computers to another. Additionally, most non-developers found it less than user friendly.
The introduction of cloud storage, combined with distributed computing, fixed the data movement problem; simply land your data in a "data lake" and then point your computer at it.
The Cloud Era to Today
A couple of other things have changed in the interim years. Spark has mostly supplanted Hadoop as a compute engine of choice (due to improved performance), while SQL has remained the most popular language for data analytics. Spark, like Hadoop, is not simple to run natively, which presented a great opportunity for Databricks, Azure Synapse and Microsoft Fabric to simplify the use of Spark and make life easier for both data engineers and analysts.
Microsoft made a significant investment in consolidating all of its analytics offerings under Fabric. While other offerings still exist, the focus of Microsoft's engineering efforts within the data platform group has been the Fabric platform. There were some early missteps and functionality gaps, but weekly builds have been a good showcase of the improvements and features Microsoft has added to Fabric. Cunningham's talk highlighted some of those deep engineering efforts, specifically around Fabric's data warehouse and SQL analytics endpoint offerings.
Understanding Fabric
Microsoft does an excellent job of building robust technical components and reusing them across its systems. A couple of good examples are in Fabric, like the T-SQL language. Also, the notion of columnstore indexes -- previously seen in SQL Server, Parallel Data Warehouse and, to an extent, Analysis Services and Power BI -- is used extensively in Fabric. These components comprise much of the Fabric engine and feature extensively across the platform.
Cunningham's talk focused on the optimizations and architecture in the data warehouse and SQL endpoint components. What is the difference between the Lakehouse SQL Endpoint and a Fabric Data Warehouse? It's like comparing Synapse Serverless versus Dedicated Capacity in that both have a similar compute engine. Still, some differences exist; for instance, you can update files and have a complete T-SQL surface area in a warehouse. The back-end compute is very similar between the two systems. You can read about other differences in more detail in this doc.
Like in Synapse Analytics, Azure SQL Database is used for the back-end for the Fabric Data Warehouse (and SQL Endpoint). These databases do not hold customer data but merely store references to Parquet and Delta files and column statistics, which are stored and accessed via a OneLake storage abstraction API. The underlying data files map into compute cells. This mapping occurs at query execution, where each cell gets placed onto a compute node. This architecture partitions the underlying data to provide more overall throughput. The number of back-end nodes dynamically scales with query complexity to deliver the query results back to the front-end node.
Fabric uses a unique caching layer. The scale-out model means that it can have unlimited cache within reason. As the cached data gets old, it is recycled and ultimately torn down. The cache is distributed over a larger number of nodes so that the distributed optimizer and scheduler spread queries across the cached nodes.
What precisely is a node? Cunningham was non-specific here, but he did mention that a node is not inherently a VM; instead, it is a unit smaller than a VM. This decision was made for a couple of reasons, like improving recoverability from task failures and restarts. Additionally, according to Cunningham, as new hardware generations arrive in Azure, services need to adjust workloads to deal with that, and mapping nodes at a sub-VM level makes it easier to deliver consistent performance across new and older hardware. I think this architecture sounds a lot like containers, and Service Fabric (Azure's core framework and a completely different product from Microsoft Fabric because naming things is hard) supports container-based deployments.
Cunningham also promised a performance guidance document for Fabric in the near future and highlighted a few points that can help optimize performance in Fabric environments. While these won't surprise longtime database pros, he highlighted the importance of choosing correct datatypes, having good statistics so the optimizer can do a better job and, finally, the size and data distribution of your parquet files.
As a writer and a technologist, it was cool to get a peek behind the curtain at how Microsoft Fabric works. Software vendors like to make complex solutions appear simple in marketing material to avoid scaring potential customers. However, the reality is that no matter how simple a service appears -- and Fabric is not simple -- the IT staff supporting it needs to understand how it functions internally to make the service work as well as it can within their organization. I appreciate Cunningham and Microsoft sharing this kind of information about Fabric, and I hope the information flow continues.
About the Author
Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.