SQL Server 2017 – Data Ordered as if Clustered Index

sql serversql-server-2017

I have the following table with 7.5 million records:

CREATE TABLE [dbo].[TestTable](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [TestCol] [nvarchar](50) NOT NULL,
    [TestCol2] [nvarchar](50) NOT NULL,
    [TestCol3] [nvarchar](50) NOT NULL,
    [Anonymised] [tinyint] NOT NULL,
    [Date] [datetime] NOT NULL,
CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

I notice that when there is a non-clustered index on the date field:

CREATE NONCLUSTERED INDEX IX_TestTable_Date ON [dbo].[TestTable] ([Date])

-and I run the following query:

UPDATE TestTable 
SET TestCol='*GDPR*', TestCol2='*GDPR*', TestCol3='*GDPR*', Anonymised=1
WHERE [Date] <= '25 August 2016'

-the data coming returned by the index access operation is sorted to match the key order of the PK/CX, reducing performance.

Query plan

I was surprised to find that removing the index from the date field actually improves the performance of the query by about 30% because it no longer performs the sort:

Query plan

My theory, and this may be obvious to the more experienced among you, is that it has figured out that the date column is implicitly ordered exactly the same as the primary key / clustered index.

So my question is: Is it possible to take advantage of this fact in order to improve the performance of my query?

Best Answer

I mocked up test data that mostly reproduces your issue:

INSERT INTO [dbo].[TestTable] WITH (TABLOCK)
SELECT TOP (7000000) N'*NOT GDPR*', N'*NOT GDPR*', N'*NOT GDPR*', 0, DATEADD(DAY, q.RN  / 16965, '20160801')
FROM
(
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q
ORDER BY q.RN
OPTION (MAXDOP 1);


DROP INDEX IF EXISTS [dbo].[TestTable].IX_TestTable_Date;
CREATE NONCLUSTERED INDEX IX_TestTable_Date ON [dbo].[TestTable] ([Date]);

Statistics for the query that uses the nonclustered index:

Table 'TestTable'. Scan count 1, logical reads 1299838, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 984 ms, elapsed time = 988 ms.

Statistics for the query that uses the clustered index:

Table 'TestTable'. Scan count 1, logical reads 72609, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 781 ms, elapsed time = 772 ms.

Getting to your question:

Is it possible to take advantage of this fact in order to improve the performance of my query?

Yes. You can use the nonclustered index that you already have to efficiently find the maximum id value that needs to be updated. If you save that to a variable and filter against it you get a query plan for the update that does the clustered index scan (without the sort) that stops early and therefore does less IO. Here's one implementation:

DECLARE @Id INT;

SELECT TOP (1) @Id = Id
FROM dbo.TestTable 
WHERE [Date] <= '25 August 2016'
ORDER BY [Date] DESC, Id DESC;

UPDATE TestTable 
SET TestCol='*GDPR*', TestCol2='*GDPR*', TestCol3='*GDPR*', Anonymised=1
WHERE [Id] < @Id AND [Date] <= '25 August 2016'
AND [Anonymised] <> 1 -- optional
OPTION (MAXDOP 1);

Run stats for the new query:

Table 'TestTable'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'TestTable'. Scan count 1, logical reads 4776, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 515 ms, elapsed time = 510 ms.

As well as the query plan:

ok query plan

With all of that said, your desire to make the query faster suggests to me that you plan to run the query more than once. Right now your query has an open ended filter on the date column. Is it truly necessary to anonymize the rows more than once? Can you avoid updating or scanning rows that were already anonymized? It should certainly be faster to update a range of dates with dates on both sides of it. You could also add the Anonymised column to your index, but that index will needed to be updated during your UPDATE query. In summary, avoid processing the same data over and over if you can.

The original query that you have with the sort is slower due to work done in the Clustered Index Update operator. The amount of time spent on the index seek and the sort is only 407 ms. You can see this in the actual plan. The plan executes in row mode so the time spent on the sort is the time of that operator along with every child operator:

enter image description here

That leaves the sort operator with about 1600 ms of time. SQL Server needs to read pages from the clustered index in order to perform the update. You can see that the Clustered Index Update operator does 1205921 logical reads. You can read more about sorting optimizations for DML and optimized prefetch in this blog post by Paul White.

The other query plan that you have (without the sort) takes 683 ms for the clustered index scan and about 550 ms for the Clustered Index Update operator. The update operator does not do any IO for this query.

The simple answer as to why the plan with the sort is slower is that SQL Server does more logical reads on the clustered index for that plan compared to the clustered index scan plan. Even if all of the needed data is in memory there's still an overhead and cost to doing those logical reads. A better answer is much harder to get, in that as far as I know the plans won't give you any further details. It is possible to use PerfView or another tool based on ETW tracing to compare call stacks between the queries:

enter image description here

On the left is the query that does the clustered index scan and on the right is the query that does the sort. I marked call stacks in blue or red that only appear in one query. Not surprisingly, the different call stacks with a high numbered of sampled CPU cycles for the sort query appear to have to do with the logical reads required to perform the update on the clustered index. In addition, there are differences in number of sampled cycles between the queries for the same operation. For sample, the query with the sort spends 31 cycles acquiring latches whereas the query with the scan only spends 9 cycles acquiring latches.

I suspect that SQL Server is picking the slower plan due to a query plan operator costing limitation. Perhaps part of the difference in run time is due to hardware or your edition of SQL Server. In any case, SQL Server is not able to figure out that the date column is implicitly ordered exactly the same as the clustered index. Data is returned from the clustered index scan in clustered key order, so there is no need to perform a sort in an attempt to optimize IO when doing the clustered index update.