Joey on SQL Server
A Deep Dive into SQL Server 2022's Query Performance Features, Part 1
An update to the cardinality estimator feature in SQL Server 2022 will be one of the biggest changes to the database engine since its inception.
- By Joey D'Antoni
- 08/29/2022
Database engines like SQL Server are incredibly complex feats of software engineering that have teams of brilliant people worldwide building them. If you want to appreciate some of the complexity, take a look at "Readings in Database Systems" to gain an idea of what goes into building an RDBMS. Essentially, engineers have to make a series of educated guesses to optimize performance across a multitude of different use cases.
While the database engine has some information in the form of column and index statistics to help it make better decisions, there are a series of decision points where the engine has to go with it its best guess. This part of query optimization is called "cardinality estimation." This science focuses on the selectivity of columns and join operations in a database.
In this two-part series, I'll tell you about some improvements Microsoft is making in SQL Server 2022 to improve cardinality estimates and deal with another common query optimization concern: parameter sniffing and suboptimal execution plans.
For most of its life, SQL Server has had a single cardinality estimation process. The cardinality estimator (CE) from SQL Server 7 (1998) still currently exists in the product, but was superseded by the Version 120 cardinality estimator in SQL Server 2014.
The upgrade to the new CE in SQL Server 2014 was a benefit to the vast majority of queries. However, a very small subset of queries had a significant impact on query performance. Microsoft addressed this by adding trace flags and, subsequently, database options to control the behavior of the CE. In SQL Server 2016, Microsoft also added the query store, which helps track query performance and execution plans over time. While the query store is a fantastic tool for monitoring query performance, it has also become the center of a number of features around automatic tuning in SQL Server (more on this below).
Why Are There Cardinals in My Database?
Cardinality is defined as "the number of elements in a set or grouping, as a property of that grouping." When we translate this into database terminology, it means: Based on the tables you are selecting from and the predicate (your WHERE) clause and/or potential join operations, how many rows will be returned to you? This estimate has a number of meaningful outputs -- how much memory is allocated to join or sort operations in your query, the type of join operations chosen, and the operations and shape of your query estimation plan. The original CE made the following four assumptions as part of its algorithm:
- Independence: The distribution of data in different columns is independent of other columns, unless there is other data (namely multi-column statistics) that indicates two columns have correlated values.
- Uniformity: Unique values are evenly spaced in a column, and those values have the same frequency within each statistical grouping (up to 200 per column).
- Containment (simple): Users will always query for data that exists. If you are joining two tables on a column, like ProductID, consider the selectivity of ProductID in each table to estimate the selectivity of the join.
- Inclusion: For WHERE clauses such as NAME='Joey', the value ‘Joey' is assumed to exist in the NAME column.
The updates to the CE in SQL Server 2014 made the following changes:
- Independence becomes Correlation. This changes the assumption that multiple column values are not independent from each other, which is more representative of real-life data.
- Simple containment becomes Base Containment. This assumption means that users may query for data that doesn't exist, which means the CE assumes that a WHERE clause with columns from two tables are not correlated with each other.
All of this is heady stuff. And as a DBA or developer, you don't really need to understand it to write effective queries that perform well. However, having some education in this background can help you better understand why the query optimizer makes the decisions it makes.
When SQL Server 2014 debuted, you didn't have a lot of options for how to fix the queries that slowed down under the new CE -- you had the big hammer of not using it, or hinting queries to try to get a specific execution plan. However, with the introduction of the query store, you could upgrade your SQL Server to a newer version and capture query performance while running under the old CE and an older compatibility level. As you upgraded to newer versions of both CE and compatibility level, the query store gave you the ability to force the use of the older execution plan to maintain consistent performance. SQL Server 2022 takes that ability a few steps further.
Last year, I wrote a column on the introduction of query store hints into Azure SQL Database. This functionality allows a DBA to add a query hint (i.e., a direct instruction to the optimizer to bypass an optimization task to do a specific thing, like use a specified index in a query plan) without putting the query hint into the SQL code base for an application. Traditionally, you would need to modify the code. Now, when the query store sees your query text and a hint exists for that query, the query is always executed with that hint.
SQL Server 2022 introduces CE feedback, which, based on the data collected in the query store, learns which CE model assumptions are valid over time for your queries. Then using the query store hints, it applies the correct set of model assumptions (after validating the assumption via testing). After the feedback changes are verified through multiple executions of the query, the feedback is persisted, and if the model changes result in a performance regression, the changes are discarded. Because this data is stored in the query store, it will be associated with the given query text and will persist across reboots, failovers, etc.
In a nutshell, database optimizers make a series of educated guesses to determine the best way to retrieve data for a given query. This process is very complex -- the stuff of graduate-level computer science courses. Microsoft changed the baseline set of assumptions it used to make those guesses in SQL Server 2014. While this proved problematic for a very small subset of queries, it helped the vast majority of queries. The new CE feedback feature in SQL Server 2022 will allow the engine to make decisions on an individual query level, which will help the performance to even more queries on a server.
Stay tuned for Part 2 of series, which will talk about parameter-sensitive query plans.
About the Author
Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.