Joey on SQL Server

The State of Microsoft Data Part 2

While Microsoft's latest offering, Fabric, is a viable option for many, it's not the only player out there.

Last time, you learned about the current state of the art around operational databases on the Microsoft Data Platform. I separated out analytic databases, like data warehouses, because in a cloud world, these are two very different types of platforms. Operational databases are optimized in terms of hardware and architecture for better write throughput, while analytic databases are typically optimized for large batch data ingest and analytic querying.

I hate to use a buzzword like "big data," but 2008 to 2013 did help change the arc of analytical systems. Before the broad adoption of Hadoop, in most cases, data warehouses lived in a relational database on a single server. When those systems were on massively parallel processing (MPP) clusters, the data was stored inside the database, making use of other tools or migration efforts problematic. Hadoop changed this paradigm by making both large-scale parallelism more attainable and using an object-storage model that is the foundation of most data lakes, data lake houses and many data warehouses. 

With that history in mind, let's examine the current state of available solutions. The significant change in this space last year is that Microsoft introduced and subsequently GAed Fabric, the all-in-one data analytics solution. I have had the opportunity to write and deliver training on Fabric and explore the solution. The vision and direction for the product are particularly good. However, the current product is rough around the edges and is a work in progress. Fortunately, due to Fabric's nature as a SaaS product, changes come frequently. 

Microsoft Fabric
Let's talk about the capabilities of Fabric -- it covers the entire surface area and is mainly made up of existing Azure components:

  • Data Engineering: Using Spark Fabric, supports authoring experiences and the entire Spark surface area.
  • Data Factory: Uses a good chunk of the existing Azure Data Factory functionality to provide ETL using Power Query and other tools.
  • Data Science: Build data science models using Azure Machine Learning for model management and training, and embed those predictive models into Power BI dashboards.
  • Data Warehouse: Similar to Azure Synapse Analytics (but different, more on that later), Fabric offers a SQL-based data warehouse experience.
  • Real-Time Analytics: This is a somewhat new functionality in the Microsoft BI stack, but uses existing technology -- Kusto Query Language from Azure Log Analytics -- to enable observational data analytics.
  • Power BI: This functionality integrates Microsoft's popular BI data visualization platform with all Fabric's data storage and management options.

Those six bullet points could each be a day or more of training. Fabric has a lot of surface area and will be used in many permutations of those components. The other thing to consider here is that given Microsoft's investments in Fabric, there will be a lot of sales pressure to move customers to Fabric. The other aspect I haven't covered is that you will likely see more integration with Microsoft Purview for governance and security features. However, if you are currently happy with your analytics stack, the current state of Fabric will not likely blow you away. I feel like if Microsoft successfully implements mirroring in Fabric, it can change the game, making business analytics much more accessible to a wide array of organizations. 

With Fabric getting all the press, many customers have asked about the future of Azure Synapse Analytics. Synapse Analytics started its life as the Azure SQL Data Warehouse service (which was a cloud implementation of the Parallel Data Warehouse functionality), which is still part of the service as "dedicated SQL capacity." but when the Synapse branding came in several years ago, Microsoft added both serverless SQL and Spark functionality, which are now also part of Fabric.  

If you are an existing Synapse customer and you are happy with your experiences, I would be in no rush to move to Fabric. Likewise, if I were considering building a new very large (5+ TB) data warehouse, Synapse would be at the top of my list on Azure. The reason for this is that the architecture of the data warehouse component in Fabric is very different, and I think, at least for right now, Synapse will be easier to work with and likely perform better. However, if you want to leverage the Spark or serverless components of Synapse, I would dive deeper into Fabric, as those feature sets are more mature on Fabric already, and that is where all the development investments are heading. 

What About Databricks?
Azure Databricks, which, despite its name, is not a native Microsoft-owned Azure service, has also become a top-rated data warehousing solution. The basis of Databricks is Apache Spark, and development is done using notebooks. Databricks supports various languages and functionality, including SQL, Scala, Java and Python. With its Unity Catalog service offering, Databricks has also made headway into the data governance space. Databricks tightly integrates with Azure -- you use Azure Active Directory and Azure Data Lake Storage (Gen 2) as part of your solution.

My observation on Databricks is that Microsoft thought it (and Snowflake) was successful enough that the company aimed to capture more of that market by introducing Fabric. Databricks made Spark approachable enough that many organizations sought to adopt it, given its power to perform data engineering and machine learning activities. If you are already using Databricks, I see no reason to switch off it in the future. If you need to integrate with Fabric, you should still be able to connect to data sources in Databricks using shortcuts. If you want to get started with a new Spark project, I would weigh my options carefully before deciding in which direction to head -- the decision point would likely be around how much of the rest of Fabric you would be using, as opposed to just looking at Spark.

One of the benefits about separating data and compute for analytics workloads is that it becomes easier to move to new solutions as they become available. I feel like this is one of the reasons why the industry has seen more development in the analytics space than on the operational side. With that in mind, it is not trivial to switch between analytical stacks, and those projects have high visibility and executive exposure, so choosing the best solution for your organization requires careful evaluation, planning and understanding of your team’s requirements as well as their skills to be able to take advantage of the full feature sets of the chosen solution.

As always, stay tuned here at Redmond as we evaluate the Microsoft data platform going forward.

About the Author

Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.

Featured

comments powered by Disqus

Subscribe on YouTube