SQL Speed Secrets
Wait-time analysis can help improve performance by focusing on how long applications take to respond to queries.
Have you ever had the feeling that your application was slowed down by SQL Server, but didn't know how to fix the problem?
SQL Server performance management is usually reactive and focused on server health. Database administrators (DBAs) respond to trouble rather than avoiding it in the first place. And visibility is largely restricted to watching the database server, rather than understanding how SQL Server directly affects application users.
Wait-time analysis, a method of improving the service and performance of SQL Server databases, changes all this. Rather than monitoring system health, wait-time analysis focuses on the time the application spends waiting on queries being processed by SQL Server. The result is an analysis technique that can quickly answer the key questions: Why is my database causing application users to wait, and what can be done?
The wait-time approach to analysis is now practical due to lightweight monitoring techniques and agentless architectures. It takes advantage of new instrumentation in SQL Server to expose wait types, the individual steps that accumulate delays as SQL Server processes queries.
Do More with Less
For the IT organization, the results of using wait-time analysis are reduced cost of database operation and improved IT service. DBAs can do more with fewer servers. Migrations from SQL Server 2000 to 2005 to 2008 become quicker, and development cycles are shortened. For IT groups tasked with providing better service with fewer resources, wait-time analysis is a cost-effective answer.
DBAs are often in a tough spot. They're accountable for database response to application users, but they have no visibility into why the database is slow. Often the issue is not in their database at all, but stems from the application code, the network or the system architecture. To get application code changed, DBAs must bring evidence to developers, who meanwhile are suspicious, because to them, the database is a poorly understood black box. "Just get a faster server!" developers say. Wrong.
Those problems are a symptom of relying on old server health-monitoring techniques to truly understand what's happening inside SQL Server.
Wait-Time Analysis Explained
Effective wait-time analysis is more complex than just looking at a snapshot of wait-type data. To be effective in generating usable information from the mass of obscure data points, it must take advantage of techniques proven in business intelligence scenarios. Key concepts include:
- Measure Time, Don't Count Operations. For the application user, the number of I/O operations or logical reads means nothing. All that counts is how long the app takes to respond. To optimize for this user perspective, focus on time taken in the database. Wait types are a method of doing this.
- Focus on Queries. The key is measuring at the level of SQL queries and individual sessions. Tools that measure wait across an entire instance or database without breaking it down further do not give actionable information.
- Continuous Capture. Keep your eyes open all the time. By watching all sessions, all of the time, the DBA can capture the occurrence of any problem. When a user calls asking for help on a slow application, the data must be already available. Systems that depend on tracing intermittently will miss problems when they occur.
- Historical View. To know what to fix, DBAs must look at trends and changes in the database, not just instantaneous results. Effective wait-time analysis takes a historical view to compare current wait-type statistics with past statistics in order to see what's different that could be the source of a new problem.
SQL Server Wait Types
Awareness of SQL Server wait types is the first step in understanding the method. Any statement running against a SQL Server will experience some form of wait as SQL Server accesses resources in order for the statement to complete. A request will wait for data to be retrieved, written to disk or for an entry to be written to the SQL Server log. You'll notice when watching an instance closely that it experiences a number of waits throughout a given time period. When waits become chronic or excessive, you may begin to see a performance problem.
Common Wait Types
SQL Server records information about the type and duration of the waits that a process experiences. While there are more than 100 different wait types in SQL Server, you will likely only ever encounter a handful of these as problems. Any wait type beginning with "LCK_" means that a task was waiting to acquire a lock. For example, a wait type of LCK_M_IX means the process was waiting to acquire an Intent Exclusive lock. More than 20 of the wait types are lock waits, which is fitting because most work being performed in SQL Server requires some sort of lock. The next most common lock types are ASYNC_IO_COMPLETION and ASYNC_NETWORK_IO. The first means a process was waiting for an I/O operation to complete. The second means that a task is waiting for I/O to complete over the network. Finally, keep an eye out for the CXPACKET wait state. This occurs when a process is trying to synchronize the query processor exchange iterator. This can indicate an issue with a server's parallelism setting. Spending time figuring out what all the potential wait states are can be time consuming. On average, about 20 of the potential wait states show up in 80 percent of problems. After doing wait-time analysis for a while, you'll get used to seeing certain wait types, including the ones looked at here.
Capturing Wait-Type Data
SQL Server has offered views of wait types for quite some time now, but unfortunately, those views have been vague and -- for the most part -- unhelpful. Starting in SQL Server 7.0 and 2000, DBAs could use Enterprise Manager (EM) to view wait types. The problem was that all EM provided was the name of the wait type and the length of time a given process had been waiting. When SQL Server Management Studio was introduced with SQL Server 2005, the views of active queries and sessions remained similar. Again, DBAs were given a wait type and duration, but not much else. The bottom line is that wait states that your processes are currently experiencing just aren't that helpful, and that's all you can really get using the standard tools.
Currently, one of the best ways to look at wait statistics in SQL Server is by using the dynamic management views (DMVs) that pertain to wait statistics. If you're still running on SQL Server 2000 or older, you're out of luck because DMVs were new to SQL Server 2005. The most pertinent DMVs for looking at wait statistics are sys.dm_exec_requests, sys.dm_exec_query_stats and sys.dm_os_wait_stats. (Note that the DMVs provide a snapshot of the counters, so to make them useful, you need to poll and calculate deltas.)
- sys.dm_exec_requests: This DMV offers information about each request that's an execution on a given SQL Server. When looking at wait states, you care about only a few of the columns that this view provides; specifically sql_handle, wait_type, wait_time, last_wait_type and wait_resource. These columns provide information about the statement being executed and the request's current wait state.
- sys.dm_exec_query_stats: This view returns aggregate performance statistics for cached queries. By using the sql_handle detail from sys.dm_exec_requests to join to a row in this view, you can start to get a picture of how often the waits you see might be occurring. Keep in mind that this view doesn't give more wait detail -- everything here is just an aggregated statistic for a given sql_handle.
- sys.dm_os_wait_stats: This view provides an aggregate picture of all wait states on a SQL Server. It provides a list of all the different waits states and detail about tasks in that state, including how many tasks are waiting in each state, the total wait time for the state and the average wait time. This detail is good for a big picture, or to get a quick idea of the types of waits occurring, but most of the real diagnostics and tuning will occur at a statement level.
In order to understand how wait-time analysis can help DBAs accomplish everyday problem resolution, here are a few scenarios to consider.
Scenario 1: Identifying the Problem Query
One of the most frustrating problems a DBA faces is the "problem query" (see Figure 1). Often, this is a query that a developer has identified as a particularly slow-running query. DBAs will usually hear that the query "ran fine in development" or "has been running fine for several weeks." Other times, repeated complaints of performance problems will lead DBAs to begin looking for the problem query in an attempt to increase performance.
[Click on image for larger view.]
|Figure 1. Example of a problem SQL query "Get State" exposed with excessive wait time.
In either case, the traditional methods of researching the problem usually involve opening several tools, such as SQL Server Profiler and Windows Performance Monitor, and trying to capture real-time problems. Specifically, most DBAs are looking at the queries that have high durations, high numbers of reads and/or writes, and queries that are being rerun frequently.
In all of the cases, however, the base numbers can be misleading. For example, queries that are being rerun frequently but very quickly may or may not be causing a bottleneck. If the base query runs quickly and efficiently, with very low wait times, there probably isn't a problem. If, however, the given query is constantly experiencing the same wait type, such as ASYNC_IO_COMPLETION, there may be a bottleneck. Determining the difference is what wait-type analysis is all about.
Scenario 2: Resolving Locking Problems
SQL Server locking is often a very confusing subject. However, using wait-type analysis, figuring out what locks are being acquired and how those locks may be blocking other processes is much easier.
Throughout the day, most SQL Servers will experience split-second locking and blocking conditions. Only when these locks result in long-term blocking is there a problem. Wait types that list locking types, such as LCK_M_SCH_M (which is a schema-modification lock), identify exactly what the process is waiting for (see Figure 2). In the case illustrated in Figure 2, a process waiting for the lock needs to actually modify the schema of the table or view, and therefore has to wait for any preceding processes that are inserting, updating or deleting data to finish.
Another potential problem is the natural extension of a single blocking process: the blocking chain. Once one process is waiting for a resource and is blocking another process, it's very likely that another process will end up waiting for the process that's waiting for the original process, and so on. The resolution to this is to find the "head" of the chain. Once the wait type of the head of the chain has been identified and resolved, the rest of the blocking chain should be freed up.
[Click on image for larger view.]
|Figure 2. LCK_M_U wait, shown by the blue bar, causes the most wait time for Get State.
Scenario 3: Finding Hardware Bottlenecks
Identifying hardware resource bottlenecks may be the most complicated scenario. While there are a number of symptoms that can point to a bottleneck, there's almost no other way to identify a hardware problem other than using wait-type analysis.
In this case, the key is to look for wait types related to either the disk subsystem (such as the PAGELATCHIO_* wait types), the CPU (CXPACKET, for example) or the general memory system (RESOURCE_* wait types). These wait types, when experienced for more than a few seconds, generally point to hardware problems.
For example, assume there's a query that usually runs for about 20 minutes and uses three table joins to determine the updates for a fourth table. The developer has provided feedback that the query has started randomly taking more than of four hours; there's no discernable pattern to when the query runs fast versus when it runs slow. A DBA can identify what wait type is occurring most frequently for that query, and what the duration is for each wait type during its run. If the wait type falls into one of the hardware-related categories, it's time to look at other queries on the system that are experiencing greater-than-expected durations in similar wait types.
Joshua Jones is a database systems consultant with Consortio Services LLC in Colorado Springs, Colo. He provides training, administration, analysis and design support for customers utilizing SQL Server 2000, 2005 and 2008. Jones speaks at numerous events about SQL Server topics, and is co-author of "A Developer's Guide to Data Modeling for SQL Server" (Addison-Wesley, 2008).
Don Bergal is the COO at Confio Software in Boulder, Colo. For the past five years he and his team have helped customers improve the performance of thousands of databases, as well as developed the Ignite Performance Intelligence methods of wait-time analysis.