Joey on SQL Server
A Deep Dive Inside Azure SQL Database, Part 2
In the second part of this series, Joey describes the methods that Microsoft uses to manage code rollouts, and how it detects hardware and software problems on Azure SQL.
- By Joey D'Antoni
In Part 1 of this series, you learned how Microsoft ensures that the Azure SQL Database platform is highly available and how it can patch things in an online manner. In this installment, I share some takeaways from my recent conversation with Peter Carlin, a Distinguished Engineer on Microsoft's Azure SQL team, in which we discussed the methodology and techniques that Microsoft uses to manage code rollouts, and how it detects and identifies both hardware and software problems on the platform.
If you have ever worked in a large organization, you know many of the challenges -- differing firmware, different generations of hardware, etc. -- that entails. Typically, a large enterprise will have tens of thousands of servers. A public cloud provider like Microsoft will have tens of millions.
Safe Deployment Process
Last time, I talked about how the physical implementation of patches happen. But how do those patches roll out?
Microsoft uses what is known as a safe deployment process (SDP) to roll out changes across Azure. In this methodology, software is first tested in an integration environment, then in special canary regions via the Early Updates Access Program (East US 2 EUAP and Central US EUAP, currently). These are representative of larger regions.
Both the scope and speed of the rollout are controllable and can be used to further mitigate deployment risks, depending on the nature of the deployment. If the rollout to those regions is successful, the changes are then rolled out at broad scale to the primary region in each region pair worldwide. Finally, the changes are rolled out to the second region in the region pair. Within Azure SQL, the deployments take place in "rings," which are clusters of 60 to 100 virtual machines (VMs) hosting Azure SQL.
While SDP is an Azure-wide process, Azure SQL brings a couple of unique challenges -- the first being that SQL Server is a very heterogeneous workload. For every basic-tier database that hosts a couple of tables and is accessed a couple of times a day, there are a number of incredibly busy multiterabyte databases that have hundreds or thousands of concurrent connections. What this means for deploying code is that a bug may show itself in only some portion of a workload or may take a very long time to show up in other workloads.
Some problems may be quickly identified; Carlin mentioned SQLServr.exe process crashes as being a good example of this. However, some problems are slower to observe and identify; memory leaks or changes in execution-plan shape are examples of these degradations.
To understand what is going on across all of those systems, Microsoft relies on a robust telemetry pipeline that, of course, strips out all customer data to meet General Data Protection Regulation (GDPR) guidelines, but allows Microsoft to do both automated and manual interventions. In the hours following the deployment of new code, the telemetry process takes on an extra importance. If health signals (for example, process crashes) spike after a code deployment, the build can automatically be rolled back. While parts of this process can happen automatically, there are thresholds that kick in and require human intervention and inspection.
While hardware problems are typically resolved with automation within the Azure infrastructure, there can be hardware problems that are harder to fully identify. An example that Carlin mentioned was a firmware problem on local SSDs (Azure SQL Premium and Business Critical use local SSD for data storage) that, in rare circumstances, caused a read/write ordering race condition that led to data corruption. However, since these tiers of Azure SQL use Availability Group technology for high availability, and the Availability Group will attempt to automatically repair pages across replicas, these rare failures were made even harder to detect in the Azure platform. This was relatively easy to detect through telemetry, but the repair and root-cause analysis was challenging due to the low frequency of the issue.
You would think that all parts of the Azure platform have the same hardware -- and, for the most part, that's accurate. Given the scale of the hardware deployed, however, there are many differences in versions of firmware, and even different generations of hardware that are still supported within Azure. This makes identifying and correlating bugs to specific hardware platforms part of the challenge of managing a large-scale database service in a public cloud.
Carlin mentioned an observation that many Microsoft Most Valuable Professionals (MVPs) I know have made: Microsoft had 20 years of experience building database software but, before the Azure platform, had very little experience with running SQL Server in production. Microsoft's experiences running SQL Server at hyperscale and the rich telemetry pipeline it has built allow the Azure SQL team to identify previously undetected -- or at least very rare -- issues within the SQL Server engine. An example of this was a lost write detection mechanism to detect the most common cause of database corruption, automatically alerting the engineering team to any mismatch in the data.
The telemetry also enables observation of common performance anti-patterns, which in turn enables focused development investments in those areas. You can see this new functionality across the SQL Server landscape with features like Accelerated Data Recovery or some of the Intelligent Query Processing features that have been built based on learning from telemetry. This also shows in the automatic tuning feature of Azure SQL Database, which can create and test new indexes and change query plans in your database without administrator intervention. This process is tested and validated in a background instance, so the platform will not implement changes until they have been tested successfully against your workload.
Building large-scale cloud services takes a lot of people and a lot of capital investment. When you look at the costs of a service versus the costs of hosting a solution on-premises or running it in an Infrastructure as a Service (IaaS) VM, you should evaluate the service to see what kind of value it delivers above you managing your own VM.
In the case of Azure SQL, you get high availability, backups, automated patching and automated tuning. Beyond just automated patching, you are always on the latest version of the SQL Server engine, with the added benefit of any regressions or problems with a patch being managed for you by Microsoft. These features should be considered as heavy bonuses when choosing database services. By using telemetry in connection with ownership of the source code, Microsoft can deliver rapid enhancements to the Azure SQL product line, including on-premises SQL Server.
Joseph D'Antoni is an Architect and SQL Server MVP with over a decade of experience working in both Fortune 500 and smaller firms. He is currently Principal Consultant for Denny Cherry and Associates Consulting. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. Joey is the co-president of the Philadelphia SQL Server Users Group . He is a frequent speaker at PASS Summit, TechEd, Code Camps, and SQLSaturday events.