Once a few weeks or months, but always on the same weekday, one query from a stored procedure used in a daily batch job may visibly become stuck and run for about 200 minutes until the application kills its connection. This is a very trivial textbook query which joins 5 tables and calculates a sum()
on one field to assign to a variable, so there is no point in providing it here. The total number of records to be summed is normally with 1-2 dozens. This query has no parameters, so parameter sniffing is (fortunately) out of question.
I have looked at a number of things:
- There are no online users in the DB. It is used purely as a data dump.
- There is no job that runs simultaneously. The last activity on the DB is a tranlog backup that completes about 10 min earlier. The next activity starts after about 2 hours, it is a full backup which completes fine after a few min (while this query is still going nuts).
-
There are following locks while this query is running, all in GRANT status:
OBJECT
lock on one of the tables requested inSch-S
modeHOBT.BULK_OPERATION
lock on another table inS
modeOBJECT
lock on yet another table requested inIX
mode
The query has (NOLOCK)
hint on all joined tables. It is in RUNNABLE
state and according to sp_whoisactive
, which runs every 15 min on the server, the CPU count of the query in question is constantly increasing. The CPU grows to about 1000x of the standalone query execution. The IO of the query from sp_whoisactive
is about the same as if I ran it standalone, but the reads count is enormous, 10000x times over standalone execution. It feels like the query is in an endless loop. But when I get into the office in the morning, the job re-runs within a few seconds.
-
The plan reported by
sp_whoisactive
is the same as standalone. There is an optimizer hint for creation of an index when I run it, but as the query standalone time is around 200ms, I am not concerned about this hint. Implementing it seems to shave another 10ms off that time. -
There are no errors reported by
DBCC CHECKTABLE
on any of the involved tables orDBCC CHECKDB
.
The only point of suspicion I may have about the plan is that it has two parallelism icons. But adding option (maxdop 1)
only increases its execution from 200 to 400ms.
There are 5 tables involved, all different sizes, the largest is around 1*10^6 records, not overly large. We refresh statistics and indexes daily at 4am and the job runs at midnight before any large volumes of data would be imported. The job itself only adds 1-10 records.
Forcing the execution plan stored for the hung query leads to the same result: stuck query, huge CPU and IO. Letting the optimizer do its magic, or forcing the plan that I get standalone (they are the same) it completes in a split second.
…
Long story short: I was allowed to run a nightly scheduled trace in production for the duration of the "normal" execution time of this job in order to try to troubleshoot this. We do not know when the next occurrence will be, only hope that it will follow the pattern which we observed – every few weeks.
My million dollar question is what events should I trace for? I am only a SQL developer and do not have the typical DBA skills, so I would not know what innards of SQL server should be exposed through a trace to catch the culprit red-handed.
Best Answer
Based on your thorough outline, I believe you may be experiencing a problem with your table statistics.
Tables only automatically update their statistics when certain thresholds of row updates are passed, in the case of any table over 500 rows it requires 500+20% of the rows to change. For example your million row table requires 200,500 row changes before it will update statistics.
An index
REBUILD
will update the statistics on a table (REORGANIZE
does not).I think your tables are growing over time and eventually invalidating your statistics, but not enough to trigger an automatic update. However, your reindexing job is not triggering an actual
REBUILD
for the relevant tables until after your query starts performing poorly due to the last couple changes from earlier in the day. If your changes are inconsistent and small enough increments this would make the performance degradation consistent and sudden. Your reindexing job then fixes the problem after the query has ran poorly and everything looks fine the next morning.Things to check to confirm:
If you are running SQL Server 2008 R2 Service Pack 1 or later, you have the option to enable documented trace flag 2371 to dynamically update statistics more frequently:
The new behaviour is enabled by default (i.e. without the trace flag) from SQL Server 2016 onward, for databases at compatibility level >= 130.