After some excellent help from Kin, wBob, and Aaron Bertrand, I finally found my root cause: Table1 statistics needed updating.
I had already confirmed that statistics were updated 2 days prior during our weekly index maintenance, so didn't really think anything of it. However, I gave it a try, and that made the query run time go from 2hrs 30min down to 2 seconds!
Command run was:
UPDATE STATISTICS Table1;
Thanks for all the help, particularly in the article "The "Runaway" Query"
I'd like to know why the optimizer does not use the clustered index,
but is using the non-clustered one?
This will be a decision of the cost based optimizer. It estimates that it is cheaper to fully scan the narrow index.
It seems that you were expecting a nested loops with seeks on the clustered index?
The execution plan shows that the table #ToPurge
has 3.9 million rows.
This will take 195 batches to clear out. Assuming that all rows in #ToPurge
are unique and initially exist in Production_Detail
and that the rows output from #ToPurge
are in a consistent order for each batch (I'm ignoring the parallelism) then each successive batch will end up seeking all rows that have already been deleted in previous batches before finally getting to a patch of rows that have not yet been processed then in total your delete
process will do more than a third of a billion index seeks for already deleted rows.
0 + 20,000 + 40,000 + ... + 3,880,000 = 378,300,000
This isn't the logic that the cost based optimizer uses (which will involve histograms of the two tables and a rowgoal for the TOP 20000
) but it shows this is not a good strategy anyway.
You could try the following to cap the number of seeks at 20,000 per iteration and see if you now get your desired plan.
CREATE TABLE #ToPurge
(
[BatchId] [INT] NULL,
[Id] [BIGINT] NOT NULL,
UNIQUE CLUSTERED([BatchId], [Id])
);
/*Your insert statement to #ToPurge here*/
WITH CTE
AS (SELECT *,
1 + ( ROW_NUMBER() OVER (ORDER BY Id) / 20000 ) AS _BatchId
FROM #ToPurge)
UPDATE CTE
SET [BatchId] = [_BatchId];
DECLARE @BatchCounter INT = 1;
WHILE @BatchCounter <= (SELECT MAX([BatchId])
FROM #ToPurge)
BEGIN
DELETE pd
FROM #ToPurge tp
INNER JOIN [DataWarehouse].[dbo].[Production_Detail] pd
ON ( tp.ID = pd.ID )
WHERE tp.BatchId = @BatchCounter
SET @BatchCounter+=1;
END
Best Answer
I can think of two possible explanations for why you experience poor performance with that query.
The first relates to blocking. Have you checked the wait events while the query is running? If it is blocked by another session and waiting for a lock then you don't have a performance problem. Instead, you have a concurrency problem.
The second relates to the table and index structure on the table that you are deleting from. The
PriceDiffs
table has 194807000 rows. If you don't have an index that can seek on theUserID
column it could take a long time to find the single row to delete. There's also a scan on thePriceDifferenceRecords
table which I assume is there to enforce a constraint or a foreign key, but that table only has 15843 rows.You might be wondering why the query takes two minutes to run when it has a relatively low estimated subtree cost of 0.0569578. This is probably due to the row goal introduced to the query by your use of
TOP 1
. SQL Server scans theIX_PriceDiffs_PriceDate
index to find a matching row but the filter is implemented as predicate instead of a seek predicate. I assume this is becauseUserID
is not a key column. Thus, SQL Server scans the index in order until it finds a row whereUserID = 384166
. If you are really lucky the first matching row could be in the first page of the index. If you are really unlucky the first matching row could be in the last page of the index, so yourDELETE
query would scan the entire table to delete just one row.It's worth noting that deleting just one row from a table often isn't a good way to test performance when you need to delete lots of rows. For a starting point, consider creating an index on
UserId
and deleting in batches.