Sql-server – A methodology for tracing a query that sporadically runs for hours instead of seconds

performancequery-performancesql serversql-server-2012trace

Once a few weeks or months, but always on the same weekday, one query from a stored procedure used in a daily batch job may visibly become stuck and run for about 200 minutes until the application kills its connection. This is a very trivial textbook query which joins 5 tables and calculates a sum() on one field to assign to a variable, so there is no point in providing it here. The total number of records to be summed is normally with 1-2 dozens. This query has no parameters, so parameter sniffing is (fortunately) out of question.

I have looked at a number of things:

  • There are no online users in the DB. It is used purely as a data dump.
  • There is no job that runs simultaneously. The last activity on the DB is a tranlog backup that completes about 10 min earlier. The next activity starts after about 2 hours, it is a full backup which completes fine after a few min (while this query is still going nuts).
  • There are following locks while this query is running, all in GRANT status:

    • OBJECT lock on one of the tables requested in Sch-S mode
    • HOBT.BULK_OPERATION lock on another table in S mode
    • OBJECT lock on yet another table requested in IX mode

The query has (NOLOCK) hint on all joined tables. It is in RUNNABLE state and according to sp_whoisactive, which runs every 15 min on the server, the CPU count of the query in question is constantly increasing. The CPU grows to about 1000x of the standalone query execution. The IO of the query from sp_whoisactive is about the same as if I ran it standalone, but the reads count is enormous, 10000x times over standalone execution. It feels like the query is in an endless loop. But when I get into the office in the morning, the job re-runs within a few seconds.

  • The plan reported by sp_whoisactive is the same as standalone. There is an optimizer hint for creation of an index when I run it, but as the query standalone time is around 200ms, I am not concerned about this hint. Implementing it seems to shave another 10ms off that time.

  • There are no errors reported by DBCC CHECKTABLE on any of the involved tables or DBCC CHECKDB.

The only point of suspicion I may have about the plan is that it has two parallelism icons. But adding option (maxdop 1) only increases its execution from 200 to 400ms.

There are 5 tables involved, all different sizes, the largest is around 1*10^6 records, not overly large. We refresh statistics and indexes daily at 4am and the job runs at midnight before any large volumes of data would be imported. The job itself only adds 1-10 records.

Forcing the execution plan stored for the hung query leads to the same result: stuck query, huge CPU and IO. Letting the optimizer do its magic, or forcing the plan that I get standalone (they are the same) it completes in a split second.

Long story short: I was allowed to run a nightly scheduled trace in production for the duration of the "normal" execution time of this job in order to try to troubleshoot this. We do not know when the next occurrence will be, only hope that it will follow the pattern which we observed – every few weeks.

My million dollar question is what events should I trace for? I am only a SQL developer and do not have the typical DBA skills, so I would not know what innards of SQL server should be exposed through a trace to catch the culprit red-handed.

Best Answer

Based on your thorough outline, I believe you may be experiencing a problem with your table statistics.

Tables only automatically update their statistics when certain thresholds of row updates are passed, in the case of any table over 500 rows it requires 500+20% of the rows to change. For example your million row table requires 200,500 row changes before it will update statistics.

An index REBUILD will update the statistics on a table (REORGANIZE does not).

I think your tables are growing over time and eventually invalidating your statistics, but not enough to trigger an automatic update. However, your reindexing job is not triggering an actual REBUILD for the relevant tables until after your query starts performing poorly due to the last couple changes from earlier in the day. If your changes are inconsistent and small enough increments this would make the performance degradation consistent and sudden. Your reindexing job then fixes the problem after the query has ran poorly and everything looks fine the next morning.

Things to check to confirm:

  • If you can catch a 'bad' query plan check the pipes' Actual versus Estimated number of rows returned. If this is radically different it's red flag for statistics issues.
  • Check the criteria on your Reindex Job to confirm that it's not updating the indexes on the query's tables regularly.
  • If you can use the following query regularly you can track when the statistics are updated:
    SELECT name AS index_name
       ,STATS_DATE(OBJECT_ID, index_id) AS StatsUpdated
    FROM sys.indexes
    WHERE OBJECT_ID = OBJECT_ID('SchemaName.TableName')
    

If you are running SQL Server 2008 R2 Service Pack 1 or later, you have the option to enable documented trace flag 2371 to dynamically update statistics more frequently:

The higher the number of rows in a table, the lower the threshold will become to trigger an update of the statistics. For example, if the trace flag is activated, update statistics will be triggered on a table with 1 billion rows when 1 million changes occur. If the trace flag is not activated, then the same table with 1 billion records would need 200 million changes before an update statistics is triggered.

The new behaviour is enabled by default (i.e. without the trace flag) from SQL Server 2016 onward, for databases at compatibility level >= 130.