Sql-server – Efficient INSERT INTO a Table With Clustered Index

clustered-indexinsertperformancesql server

I have a SQL statement that inserts rows into a table with a clustered index on the column TRACKING_NUMBER.

E.G.:

INSERT INTO TABL_NAME (TRACKING_NUMBER, COLB, COLC) 
SELECT TRACKING_NUMBER, COL_B, COL_C 
FROM STAGING_TABLE

My question is – does it help to use an ORDER BY clause in the SELECT statement for the clustered index column, or would any gain acheived be negated by the extra sort required for the ORDER BY clause?

Best Answer

As the other answers already indicate SQL Server may or may not explicitly ensure that the rows are sorted in clustered index order prior to the insert.

This is dependant upon whether or not the clustered index operator in the plan has the DMLRequestSort property set (which in turn depends upon the estimated number of rows that are inserted).

If you find that SQL Server is underestimating this for whatever reason you might benefit from adding an explicit ORDER BY to the SELECT query to minimize page splits and ensuing fragmentation from the INSERT operation

Example:

use tempdb;

GO

CREATE TABLE T(N INT PRIMARY KEY,Filler char(2000))

CREATE TABLE T2(N INT PRIMARY KEY,Filler char(2000))

GO

DECLARE @T TABLE (U UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),N int)

INSERT INTO @T(N)
SELECT number 
FROM master..spt_values
WHERE type = 'P' AND number BETWEEN 0 AND 499

/*Estimated row count wrong as inserting from table variable*/
INSERT INTO T(N)
SELECT T1.N*1000 + T2.N
FROM @T T1, @T T2

/*Same operation using explicit sort*/    
INSERT INTO T2(N)
SELECT T1.N*1000 + T2.N
FROM @T T1, @T T2
ORDER BY T1.N*1000 + T2.N


SELECT avg_fragmentation_in_percent,
       fragment_count,
       page_count,
       avg_page_space_used_in_percent,
       record_count
FROM   sys.dm_db_index_physical_stats(2, OBJECT_ID('T'), NULL, NULL, 'DETAILED')
;  


SELECT avg_fragmentation_in_percent,
       fragment_count,
       page_count,
       avg_page_space_used_in_percent,
       record_count
FROM   sys.dm_db_index_physical_stats(2, OBJECT_ID('T2'), NULL, NULL, 'DETAILED')
;  

Shows that T is massively fragmented

avg_fragmentation_in_percent fragment_count       page_count           avg_page_space_used_in_percent record_count
---------------------------- -------------------- -------------------- ------------------------------ --------------------
99.3116118225536             92535                92535                67.1668272794663               250000
99.5                         200                  200                  74.2868173956017               92535
0                            1                    1                    32.0978502594514               200

But for T2 fragmentation is minimal

avg_fragmentation_in_percent fragment_count       page_count           avg_page_space_used_in_percent record_count
---------------------------- -------------------- -------------------- ------------------------------ --------------------
0.376                        262                  62500                99.456387447492                250000
2.1551724137931              232                  232                  43.2438349394613               62500
0                            1                    1                    37.2374598468001               232

Conversely sometimes you might want to force SQL Server to underestimate the row count when you know the data is already pre-sorted and wish to avoid an unnecessary sort. One notable example is when inserting a large number of rows into a table with a newsequentialid clustered index key. In versions of SQL Server prior to Denali SQL Server adds an unnecessary and potentially expensive sort operation. This can be avoided by

DECLARE @var INT =2147483647

INSERT INTO Foo
SELECT TOP (@var) *
FROM Bar

SQL Server will then estimate that 100 rows will be inserted irrespective of the size of Bar which is below the threshold at which a sort is added to the plan. However as pointed out in the comments below this does mean that the insert will unfortunately not be able to take advantage of minimal logging.