SQL Server Performance – Concerning Frequency of Hash/Sort Spills into TempDB

performancesql servertempdb

Our enterprise application uses SQL Server for data storage and is primarily an OLTP system. However, an important component of our application generates a significant OLAP workload.

Our write latency to tempdb is about 100ms. This trend holds over time, and ALLOW_SNAPSHOT_ISOLATION is turned off. We are troubleshooting this concerning problem and the only interesting thing we've found so far is that there are a significant number of hash and sort spills to tempdb. We surmise this is coming from our OLAP workload.

Question

What frequency of spills is concerning? Any? How many spills/sec? Our preliminary data indicates that we have about 2 hash spills per second and 25 sort spills per minute.

Is it possible that this frequency of spills could be a primary culprit in our high tempdb write latency?

Other information

We're using multiple files for tempdb as recommended per number of cores. The tempdb files are on a RAID 1+0 SAN (with high performance SSDs) but that's the same device as the main DB data and log files. The tempdb files are sized large enough that they grow very infrequently. We aren't using trace flags 1117 or 1118. Another variable is that this set up is shared for a number of different databases that all experience medium to high load.

Our 100 ms write latency is much greater than the acceptable ranges for tempdb write latency that we have found on MSDN, SQL Skills, and other sites. However, write latency for our other databases is good (below 10ms). Based on other stats, it does appear we're using tempdb heavily, particularly for internal objects. So we're digging in to try to find out why our application is using internal objects so heavily.

We do have real performance issues on our platform that manifest in varying ways. We've been monitoring perf counters, looking at DM views, and analyzing our app behavior to try to dig into the resource usage characteristics of our system. We're focused on spills right now as we've read that spills have drastic negative impact because they are performed on disk instead of in memory. And we appear to have a very high number of spills, but I wanted to get some input on what people consider "high."

Best Answer

Is it possible that this frequency of spills could be a primary culprit in our high tempdb write latency?

Yes it is possible, though typically it is the average size of the spills, and how deep they go (i.e. recursive hash spills, multi-pass sorts) that matters more than the frequency per se.

SQL Server provides a wide range of metrics and DMV information to help you troubleshoot the various contributing factors to tempdb pressure, many of which are discussed in the Microsoft Technical Article, "Working with tempdb in SQL Server 2005" (applies to all versions 2005 onward).

You should be able to use the guidance and diagnostic queries contained in that document to start identifying the primary causes of any tempdb pressure. Do not disregard e.g. version store activity simply because ALLOW_SNAPSHOT_ISOLATION is not enabled. Many features use the version store (e.g. triggers, MARS, RCSI) aside from snapshot isolation.

If sort and hash spills do turn out to be significant at a high level, you will probably need to set up some specific monitoring for this. Depending a little on your SQL Server version, this is not always a straightforward as one might hope. To connect sort and hash spills with the particular query that caused them requires Event Notifications or Extended Events. The SolidQ article, "Identifying and Solving Sort Warnings" contains details and some good general advice about resolving common causes.

You should also work with your storage team to determine how much of the high latency is attributable to your workload, how much comes from other shared uses, and what options there are for reconfiguration. Your analysis of SQL Server's metrics will help inform this discussion, as will any metrics the SAN people are able to provide.