Joey on SQL Server

Understanding Columnstore Indexes in SQL Server

First introduced in SQL Server 2012, columnstore indexes can give you major performance gains -- provided you have the right workloads. Here's how columnstore indexes work and what types of data work best with them.

A major benefit of relational databases is that they allow for a high degree of selectivity. If you want to retrieve an outstanding balance for a given customer, in an optimized table with correct indexes, you can do that by accessing a single row in table.

I've spent a large portion of my DBA career trying to optimize indexes and the statistics about the data in them in an effort to get the database engine to generate an optimal query plan. The operation to return a page or a handful of pages in a database is called a seek in most database engines. It is very efficient because, ideally, the query reads only a handful of pages, as opposed to the scan operation, which reads all of the rows in a given index or table.

While database performance-tuning draws a lot of interest in both conference sessions and columns, there is a high amount of effort involved. Many organizations lack the in-house database skills to perform this level of tuning for their applications (this includes software vendors that frequently ship non-optimized databases).

SQL Server 2012 introduced a concept called columnstore indexes that was later added to all editions starting with SQL Server 2016 Service Pack 1. Traditional database indexes have a tree structure on disk (called a B-tree) that has a key value, which allows the database engine to quickly retrieve the row needed using a key. Columnstore indexes instead store each column of the table individually.

This architecture has a couple of benefits. Since the data is stored with other data in its column, there is a lot of commonality in the data, allowing for a much higher degree of compression and reducing the amount of disk I/O and memory required to provide query results. The other benefit is that only the columns referenced in a query need to be scanned in order to return query results.

Columnstore databases have been around for a long time. The first systems were designed the 1970s and became more broadly adopted as Big Data systems came into fruition in the early 2000s. Their focus has long been on online analytical processing (OLAP) systems such as data warehouses are "read mostly." While these tables are updated, most of the activity taking place against a data warehouse are read operations. Updating columnstore tables is a challenge.

In SQL Server, updates and inserts are processed initially into a B-tree table called a delta store. Eventually, a process called a tuple mover compresses the data into the columnstore structure. SQL Server uses a different process for bulk-loading data (the most common way OLAP systems ingest data) that uses memory to compress data directly into the columnstore.

A good example of an OLAP query is shown below:

-- Execute a typical query that joins the Fact Table with dimension tables
-- Note this query will run on the Page Compressed table, Note down the time
SELECT b.SalesTerritoryRegion
    ,FirstName + ' ' + LastName AS FullName
    ,count(SalesOrderNumber) AS NumSales
    ,sum(SalesAmount) AS TotalSalesAmt
    ,Avg(SalesAmount) AS AvgSalesAmt
    ,count(DISTINCT SalesOrderNumber) AS NumOrders
    ,count(DISTINCT ResellerKey) AS NumResellers
FROM FactResellerSalesXL_CCI a
INNER JOIN DimSalesTerritory b ON b.SalesTerritoryKey = a.SalesTerritoryKey
INNER JOIN DimEmployee d ON d.Employeekey = a.EmployeeKey
INNER JOIN DimDate c ON c.DateKey = a.OrderDateKey
WHERE b.SalesTerritoryKey = 3
    AND c.FullDateAlternateKey BETWEEN '1/1/2006' AND '1/1/2010'
GROUP BY b.SalesTerritoryRegion,d.EmployeeKey,d.FirstName,d.LastName,c.CalendarYear
GO

This sample query (against the AdventureWorks sample data warehouse) inherently reads all of the sales records in the FactResellerSalesXL_CCI table, which means a scan occurring against all 11 million records in the table and joins to a date dimension and employee and geography tables to provide a sales record.

The columnstore structure offers two major benefits here. There are only seven of the 27 columns in the table in this query, so those other columns do not need to be read. The other performance benefit of columnstore indexes in SQL Server provide is batch execution mode. This mode allows for the sums and averages (or other aggregations) in this query to be processed in approximately 900 row batches, as opposed to line by line (SQL Server 2019 adds batch mode to row-based tables).

The power of the compression and batch mode is evident. On a small Azure virtual machine, the above query executes 10 times faster against the columnstore table than the same table in a compressed row store. If you enable batch mode for the row-based table, the query is still seven times slower than the columnstore.

While Microsoft initially advocated for columnstore indexes to be used on large data warehouse tables, there is a case to be made for any table that's more than a few thousand rows and is not frequently updated. Inserts, while not ideal, are OK, especially for smaller tables that can be easily rebuilt.

The bigger question is: What do most of your queries do? Are you returning a single record from your table, or are you looking at most of the rows to perform an aggregation like a sum or an average?

select SalesAmount from FactResellerSalesXL_CCI where SalesOrderNumber = 'SO1190941'

The above query takes less than a millisecond against a row store but takes over two seconds against the columnstore index. That is a pretty big performance gap. However, if you only have a handful of queries that are returning a specific record and most of your queries are scanning the table, you could see tremendous benefits by moving these tables to a columnstore model. I would recommend doing this with tables that are very large or very static. If your data isn't changing, using a columnstore index makes a ton of sense.

The caveat to all performance-tuning advice is that you need to understand your workloads. If all of your queries are returning a single record, using a B-tree index is your best choice. However, a lot of workloads are aggregating records, and will benefit from both the compression and batch execution mode. Understanding columnstore indexes and where they can fit into your workload can give you major performance gains.

About the Author

Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.

Featured

comments powered by Disqus

Subscribe on YouTube