As the other answers already indicate SQL Server may or may not explicitly ensure that the rows are sorted in clustered index order prior to the insert
.
This is dependant upon whether or not the clustered index operator in the plan has the DMLRequestSort
property set (which in turn depends upon the estimated number of rows that are inserted).
If you find that SQL Server is underestimating this for whatever reason you might benefit from adding an explicit ORDER BY
to the SELECT
query to minimize page splits and ensuing fragmentation from the INSERT
operation
Example:
use tempdb;
GO
CREATE TABLE T(N INT PRIMARY KEY,Filler char(2000))
CREATE TABLE T2(N INT PRIMARY KEY,Filler char(2000))
GO
DECLARE @T TABLE (U UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),N int)
INSERT INTO @T(N)
SELECT number
FROM master..spt_values
WHERE type = 'P' AND number BETWEEN 0 AND 499
/*Estimated row count wrong as inserting from table variable*/
INSERT INTO T(N)
SELECT T1.N*1000 + T2.N
FROM @T T1, @T T2
/*Same operation using explicit sort*/
INSERT INTO T2(N)
SELECT T1.N*1000 + T2.N
FROM @T T1, @T T2
ORDER BY T1.N*1000 + T2.N
SELECT avg_fragmentation_in_percent,
fragment_count,
page_count,
avg_page_space_used_in_percent,
record_count
FROM sys.dm_db_index_physical_stats(2, OBJECT_ID('T'), NULL, NULL, 'DETAILED')
;
SELECT avg_fragmentation_in_percent,
fragment_count,
page_count,
avg_page_space_used_in_percent,
record_count
FROM sys.dm_db_index_physical_stats(2, OBJECT_ID('T2'), NULL, NULL, 'DETAILED')
;
Shows that T
is massively fragmented
avg_fragmentation_in_percent fragment_count page_count avg_page_space_used_in_percent record_count
---------------------------- -------------------- -------------------- ------------------------------ --------------------
99.3116118225536 92535 92535 67.1668272794663 250000
99.5 200 200 74.2868173956017 92535
0 1 1 32.0978502594514 200
But for T2
fragmentation is minimal
avg_fragmentation_in_percent fragment_count page_count avg_page_space_used_in_percent record_count
---------------------------- -------------------- -------------------- ------------------------------ --------------------
0.376 262 62500 99.456387447492 250000
2.1551724137931 232 232 43.2438349394613 62500
0 1 1 37.2374598468001 232
Conversely sometimes you might want to force SQL Server to underestimate the row count when you know the data is already pre-sorted and wish to avoid an unnecessary sort. One notable example is when inserting a large number of rows into a table with a newsequentialid
clustered index key. In versions of SQL Server prior to Denali SQL Server adds an unnecessary and potentially expensive sort operation. This can be avoided by
DECLARE @var INT =2147483647
INSERT INTO Foo
SELECT TOP (@var) *
FROM Bar
SQL Server will then estimate that 100 rows will be inserted irrespective of the size of Bar
which is below the threshold at which a sort is added to the plan. However as pointed out in the comments below this does mean that the insert will unfortunately not be able to take advantage of minimal logging.
The blog post you reference also indicates how you could have answered this yourself.
If you execute
SET SHOWPLAN_TEXT ON;
GO
UPDATE T SET A = 19;
The plan looks like
|--Clustered Index Update(OBJECT:([AdventureWorks2008].[dbo].[T].[TPK]), OBJECT:([AdventureWorks2008].[dbo].[T].[TF]), SET:([AdventureWorks2008].[dbo].[T].[A] = [@1]))
|--Compute Scalar(DEFINE:([Expr1009]=[Expr1009]))
|--Compute Scalar(DEFINE:([Expr1009]=CASE WHEN CASE WHEN [AdventureWorks2008].[dbo].[T].[A] = [@1] THEN (1) ELSE (0) END THEN (0) ELSE (1) END))
|--Top(ROWCOUNT est 0)
|--Index Scan(OBJECT:([AdventureWorks2008].[dbo].[T].[TF]), ORDERED FORWARD)
showing a per-row / narrow plan where the index TF
is listed as one of the objects updated.
Best Answer
I've constructed a test-bed to see what happens:
The first column is the primary key, and as you can see the values are listed in random(ish) order. Listing the values in random order should make SQL Server either:
The
CRYPT_GEN_RANDOM()
function is used to generate 1024 bytes of random data per row, to allow this table to consume multiple pages, which in turn allows us to see the effects of fragmented inserts.Once you run the above insert, you can check fragmentation like this:
Running this on my SQL Server 2012 Developer Edition instance shows average fragmentation of 90%, indicating SQL Server did not sort during the insert.
The moral of this particular story is likely to be, "when in doubt, sort, if it will be beneficial". Having said that, adding and
ORDER BY
clause to an insert statement does not guarantee the inserts will occur in that order. Consider what happens if the insert goes parallel, as an example.On non-production systems you can use trace flag 2332 as an option on the insert statement to "force" SQL Server to sort the input prior to inserting it. @PaulWhite has an interesting article, Optimizing T-SQL queries that change data covering that, and other details. Be aware, that trace flag is unsupported, and should NOT be used in production systems, since that might void your warranty. In a non-production system, for your own education, you can try adding this to the end of the
INSERT
statement:Once you have that appended to the insert, take a look at the plan, you'll see an explicit sort:
It would be great if Microsoft would make this a supported trace flag.
Paul White made me aware that SQL Server does automatically introduce a sort operator into the plan when it thinks one will be helpful. For the sample query above, if I run the insert with 250 items in the
values
clause, no sort is implemented automatically. However, at 251 items, SQL Server automatically sorts the values prior to the insert. Why the cutoff is 250/251 rows remains a mystery to me, other than it seems to be hard-coded. If I reduce the size of the data inserted in theSomeData
column to just one byte, the cutoff is still 250/251 rows, even though the size of the table in both cases is just a single page. Interestingly, looking at the insert withSET STATISTICS IO, TIME ON;
shows the inserts with a single byteSomeData
value take twice as long when sorted.Without the sort (i.e. 250 rows inserted):
With the sort (i.e. 251 rows inserted):
Once you start to increase the row size, the sorted version certainly becomes more efficient. When inserting 4096 bytes into
SomeData
, the sorted insert is nearly twice as fast on my test rig as the unsorted insert.As a side-note, in case you're interested, I generated the
VALUES (...)
clause using this T-SQL:This generates 1,000 random values, selecting only the top 50 rows with unique values in the first column. I copied-and-pasted the output into the
INSERT
statement above.