Joey on SQL Server
A Deep Dive Inside Azure SQL Database, Part 1
Microsoft promises 99.995 percent uptime for the Azure SQL platform, which is incredible. In the first of a two-part series, Joey takes an close look at the infrastructure that supports high availability in Azure database services.
- By Joey D'Antoni
Welcome to the first of a two-part series. In this first part, I'll talk about the physical infrastructure that Microsoft uses to ensure availability and provide a high degree of uptime for the Azure SQL platform. In Part 2, I'll talk with Microsoft engineers about how they manage and deploy Azure SQL Database and Azure SQL Managed Instance from a deployment and operations perspective.
Azure SQL Database and Azure SQL Managed Instance are what are known as Platform as a Service (PaaS) offerings, which means the vendor provides nearly all of the infrastructure services of the platform. Microsoft provides a 99.995 percent uptime service level agreement (SLA) for Azure database services, which is incredible. Let's take a look at the infrastructure that supports it.
It is important to understand the sheer level of redundancy that is built into the Azure platform. When your bottom line depends on providing computing services to nearly all of the companies in the world, you can make major investments in hardware redundancy to reach those levels of uptime. In my experience, most smaller organizations (and even some enterprises) don't invest enough in their on-premises availability, which is one of the many reasons they can benefit from migrating to the public cloud.
Both Azure SQL services are built on a framework called Service Fabric that provides high-availability and allows for reliable services like load balancing, microservices and network infrastructure. This means that the high-availability orchestration is provided by the fabric, not by Windows Server failover clustering like you might implement in your own environment. This allows the SQL Server team to have more granular control over failover operations and better interaction with Azure services.
There are two availability models for the Azure SQL platform: Standard and Premium. The Standard model is used in Basic, Standard and General Purpose service levels, and the Premium model is used in the Premium and Business Critical service levels.
In the Standard model, there is a single virtual machine (VM) that uses Azure Premium Storage for its high availability, offering functionality similar to a traditional failover cluster instance of SQL Server. User databases are stored on remote storage and TempDB is stored local to the VM running the SQL Server process. Azure Premium Storage uses an object-based storage model and provides data integrity by storing three synchronous copies of files (a write is not acknowledged as complete until it reaches the third copy). This allows Microsoft to provide a good SLA while keeping its cost of goods sold manageable.
The Premium model uses a three- to four-node cluster of database VMs with local SSD storage. This offers a few benefits: lower latency compared to the remote model, readable secondary replica functionality and a lower recovery time from hardware or software failures (the transaction log between databases on nodes is synced, as opposed to crash recovery, which occurs in the Standard model).
The Premium model is more robust and offers faster recovery, but it comes at a cost: From Microsoft's perspective the compute costs are three to four times more than the Standard model. When you compare pricing from General Purpose to Business Critical, remember that you are paying for four VMs as opposed to a single VM, and note that the monthly costs to the customer are only about two times that of the Standard availability model for a given vCore count.
While the availability configuration on Azure SQL Database and Azure SQL Managed Instance is not that different from what you might have in an on-premises environment, the patching process (applying updates to SQL Server, the underlying operating system or Service Fabric) is far more advanced -- and, in my opinion, magic.
In any other SQL Server environment, in order to patch the service, you have to restart the service so the new executable files that were installed in the patching process can be used. What Microsoft has done with Azure SQL Database is use a technique called "hot patching" that allows modifying the C++ code within the sqlservr.exe process without pausing activity to the databases. This process effectively replaces sqlservr.exe with sqlservr_patched.exe on disk.
After the code is modified, the new images are loaded into memory -- not directly replacing the existing processes, but into empty space in memory. The way this newly loaded process hooks into the active process is that calls from the original functions are redirected to the new functions. If there are problems with the new code, the redirect can be rolled back to the original process. It does this by leaving six bytes of empty space between functions at the time of compilation, allowing Microsoft to write a jump instruction to point at a new version of the function.
The benefits of hot patching are two-fold: It eliminates the downtime required for patching and allows Microsoft to patch more frequently than it otherwise might. This allows Microsoft to deliver fixes and service enhancements at a higher velocity to the Azure SQL Database service.
Even with the levels of automation and redundancy that are built into the Azure platform, outages still happen, whether it's hardware failures or software failures. Outages happen everywhere, but when you are operating at the scale of the public cloud, you need to handle these problems in an automated fashion.
Microsoft has built these guidelines into Azure Service Health, which is capable of automated detection, engagement and mitigation of cloud outages. Microsoft aims to notify customers within 15 minutes from the start of the impact.
Building a highly available, robust database platform is challenging. Managing the state of databases makes high-availability solutions more challenging than stateless workloads like Web servers. Microsoft has built in numerous failsafes and improved its own software to build database infrastructure that can withstand failure and be upgraded with no downtime.
Stay tuned for Part 2 to learn more about how Microsoft manages the Azure SQL ecosystem.
Joseph D'Antoni is an Architect and SQL Server MVP with over a decade of experience working in both Fortune 500 and smaller firms. He is currently Principal Consultant for Denny Cherry and Associates Consulting. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. Joey is the co-president of the Philadelphia SQL Server Users Group . He is a frequent speaker at PASS Summit, TechEd, Code Camps, and SQLSaturday events.