Sql-server – Query turned into a CPU gobbling monster

performancequery-performancesql serversql server 2014

I have a SP who's execution time went from 5 minutes > 20 minutes > 30 minutes > 53 minutes over four days

Waits were showing increasing CPU and suspended status

I isolated a single query that pegs the CPU

UPDATE thing.table
SET YYYYMM = 
CASE 
  WHEN 
    DAY(SnapshotDate) = 1
    OR 
  SnapshotDate = (SELECT MAX(SnapshotDate) FROM thing.table) 
  THEN CAST(FORMAT(DATEADD(day,-1,snapshotdate),'yyyyMM') AS INT)
  ELSE NULL
END

I ran it again, adding WITH (RECOMPILE) at the end – no difference

I ran UPDATE STATISTICS thing.table – no difference

It would be interesting to run it and get the actual plan but I don't want to peg the CPU for an hour. I checked sys.dm_exec_cached_plans but it appears to only have the estimated plan not the actual plan

I rewrote using CONVERT instead of FORMAT (because I am suspicious of new things) – no difference

So I rewrote like this and took execution back down to a few seconds:

BEGIN TRAN;

UPDATE thing.table
SET YYYYMM = NULL;

UPDATE thing.table
SET YYYYMM = CAST(FORMAT(DATEADD(day,-1,snapshotdate),'yyyyMM') AS INT)
WHERE 
(
DAY(SnapshotDate) = 1 
OR
SnapshotDate = (SELECT MAX(SnapshotDate) FROM thing.table) 
);

COMMIT TRAN;

The table has about 150,000 records in it. It's quite possible that it recently had a lot more records dumped in it, skewing statistics, but why would WITH(RECOMPILE) and UPDATE STATISTICS not fix that?. It takes a daily snapshot and possibly the number of records increased due to end of month.

So the questions are:

Is actual query plan stored anywhere?, to save me running it interactively?
Normally when a query suddenly takes forever it stale statistics but that doesn't seem to be the case here

This is my version of SQL Server

Microsoft SQL Server 2014 – 12.0.4100.1 (X64) Apr 20 2015 17:29:27 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor)

Here are the slow and fast query plans. No suprise they are different because they are doing different things:

Slow Plan:

Fast Plan:

I notice slowpoke uses an loop join and fasty uses a hash match.

I notice the small leg of the loop join has filter

[Expr1006]=DB.thing.table.[SnapshotDate]. Maybe that wasn't so small anymore?

Best Answer

The first query is so slow because it will do a full table scan on thing.table for every row in that table for which DAY(SnapshotDate) <> 1. So if you have 100k rows in the table in the worst case you'll do 100k scans which means reading through 10 billion rows. If the table is small enough it'll stay in memory so your parallel query will appear to burn through CPU.

You can tell by looking at the query plan carefully. The scan is on the inner side of a nested loop join. If that's not your cup of tea you can try live query statistics to see the query as it executes. That way you can get some of the information from the actual plan without needing the query to finish. There's no way to save off old actual plans without setting up extended events.

The second query is faster because the query optimizer is freer to rearrange the elements of the query due to the lack of the CASE expression. Instead of calculating the subquery for MAX(SnapshotDate) once per row the calculation is done once per query.

You'll definitely want to fix this query in some way or the execution time will continue to grow quadratically with the number of rows in the table. One workaround would be to add an index to the SnapshotDate column. The subquery will still execute once for each row but getting the maximum value will be a very cheap operation. A better way is to save off the value of the subquery to a local variable and to use that in your UPDATE query. Unless you have to worry about concurrency that shouldn't be an issue.

You can also stick to the fix that you found if you want. One suggestion that can help in some cases (depending on the table structure) would be to add a where clause to your first UPDATE:

UPDATE thing.table
SET YYYYMM = NULL
WHERE YYYYMM IS NOT NULL;

Related Solutions

Sql-server – Same Query takes 0 seconds on Server A, 7.5 minutes on Server B (same db/hardware/config)

This problem has been resolved, although not quite how I anticipated.

Performance did not change after flushing the buffer pool, the procedure cache, rebooting the server, or rebuilding the indexes. The lengthy index spool continued to appear sometimes and not others in the query plan.

The fix was to change the reference in the where clause from ib_charge.patient_account_fk = ### to ib_header.patient_account_fk (the itemized bill (IB) header table is much smaller than the ib_charge table). This resulted in the server ceasing to use the Index Spool in all cases, which was the cause of the performance hit.

Sql-server – Cannot tune database any further; what next

Fulltext isn't going to help without refactoring to use the full text functions ( CONTAINS, FREETEXT or their table equivalents ). It also doesn't really work with leading wildcard. Hacks are available, but basically you're going to struggle to write a semantically equivalent query for fulltext. For the future consider redesigning for fulltext which has stemming ( run, runner, running ) and thesaurus ( jogger ) which could serve your searches much better than two wildcards.

SSD is unlikely to help you unless you are memory bound. Your tables (at only 500k records) are probably in-memory most of the time. Can you confirm the size of the dJobs table, and server RAM?

Enterprise Edition could help where the limitation of 64GB RAM / lesser of 4 sockets or 16 cores goes up to 8, but you're going to need a really powerful box to notice a difference. For example, the 4 really means you could have something like 4 quad-core processors totalling 16 cores, with HT enabled, you're already at 32 logical processors. The general recommended server maxdop for this type of OLTP machine would be 8 anyway. I think this unlikely to benefit because your query has more fundamental problems but you never know.

Non-clustered indexes (particularly on dJobs) are unlikely to help because the query has so many columns from this table in the SELECT and many criteria in the WHERE clause. A non-clustered would have to be so wide to cover it would be practically a duplicate of the clustered index, therefore overly expensive to maintain. As the query sorts by jobID DESC, I considered a descending index but haven't trialled this.

Partitioning, (Enterprise only) is really a great feature, but again is unlikely to help you. I did a quick investigation of partitioning on dbo.dJobs.jobJobStatus column, eg I imagine you only have a small percentage of Jobs 'active' at any one time, eg a few hundred, even a few thousand from the 500,000 records. Partition elimination would probably be cancelled out by the OR OR OR approach. Parallel scans of multiple partitions are also an Enterprise feature:

This would probably work:

SELECT TOP 20 *
FROM dJobs
    LEFT JOIN dClients on cltClientID = jobClientId
    LEFT JOIN dUsers on regUserId = jobCoordinator
    LEFT JOIN dJobStatus ON jbsID = jobJobStatus
WHERE
    (
    jobjobStatus IN ( SELECT jbsid FROM djobstatusgroupmapping WHERE jsgid = 0 )
    )
ORDER BY jobID DESC

This probably won't work:

SELECT TOP 20 *
FROM dJobs
    LEFT JOIN dClients on cltClientID = jobClientId
    LEFT JOIN dUsers on regUserId = jobCoordinator
    LEFT JOIN dJobStatus ON jbsID = jobJobStatus
WHERE
    (
    jobjobStatus IN ( SELECT jbsid FROM djobstatusgroupmapping WHERE jsgid = 0 )
    OR ( 0=0 ) OR ( 0=0 )   --<< this 'OR always true' means 'get the whole table'
    )
ORDER BY jobID DESC

This leads me into the query. The OR OR OR approach basically means 'always get the whole table'. The TOP 20 masks this design problem. The TOP also probably pushed the plan towards Nested Loops which Jon suggested was suspect. What also stands out to me about this nightmareish "scan all columns" constructed query is that you bascially have two copies of the same query (and therefore tables), one to count, one for the resultset. This might be more efficient if the data went into an intermediate table and the count was done from there for example.

Finally, this brings me to the only only thing that would actually help you (without a large-scale refactor of the code): data deletion or archiving. As mentioned, I imagine you only have a small percentage of Jobs 'active' at any one time. Carve off the 'inactive' ones into a different table. Create a view over the top of the two tables for reporting. Set up a nightly job to copy out the old records.

Having only a few thousand active jobs in your main table will most likely transform your query performance.

Some recommended reading:

Erland Sommarskog's article on these "search all columns" queries Dynamic Search Conditions in T-SQL http://www.sommarskog.se/dyn-search-2008.html

Querying Multiple Columns (Full-Text Search) http://technet.microsoft.com/en-us/library/ms142488(v=sql.105).aspx

I hope that helps!

Best Answer

Related Solutions

Sql-server – Same Query takes 0 seconds on Server A, 7.5 minutes on Server B (same db/hardware/config)

Sql-server – Cannot tune database any further; what next

Related Question