Sql-server – Is pre-sorting valuable when using table value constructor to insert multiple values into a table with clustered index

clustered-indexinsertperformancesql server

This is related to this question: Efficient INSERT INTO a Table With Clustered Index

From that question it sounds like SQL Server will sometimes pre-sort the data when inserting multiple values into a table with a clustered index to reduce fragmentation. Will SQL Server also do this when I use a Table Value Constructor to insert multiple values in an insert statement?

For example, if I have a table event_log that is clustered on a timestamp column and I insert data like this:

INSERT INTO event_log (timestamp, data)
VALUES
('03:00:02', 'data string...'),
('03:00:01', 'data string...'),
('03:00:00', 'data string...')

Will SQL Server pre-sort for me to ensure the data goes in efficiently without fragmentation or should I sort the data in my app before generating the SQL?

Best Answer

I've constructed a test-bed to see what happens:

USE tempdb;

CREATE TABLE dbo.TestSort
(
    Sorted INT NOT NULL
        CONSTRAINT PK_TestSort
        PRIMARY KEY CLUSTERED
    , SomeData VARCHAR(2048) NOT NULL
);

INSERT INTO dbo.TestSort (Sorted, SomeData)
VALUES  (1797604285, CRYPT_GEN_RANDOM(1024))
    , (1530768597, CRYPT_GEN_RANDOM(1024))
    , (1274169954, CRYPT_GEN_RANDOM(1024))
    , (-1972758125, CRYPT_GEN_RANDOM(1024))
    , (1768931454, CRYPT_GEN_RANDOM(1024))
    , (-1180422587, CRYPT_GEN_RANDOM(1024))
    , (-1373873804, CRYPT_GEN_RANDOM(1024))
    , (293442810, CRYPT_GEN_RANDOM(1024))
    , (-2126229859, CRYPT_GEN_RANDOM(1024))
    , (715871545, CRYPT_GEN_RANDOM(1024))
    , (-1163940131, CRYPT_GEN_RANDOM(1024))
    , (566332020, CRYPT_GEN_RANDOM(1024))
    , (1880249597, CRYPT_GEN_RANDOM(1024))
    , (-1213257849, CRYPT_GEN_RANDOM(1024))
    , (-155893134, CRYPT_GEN_RANDOM(1024))
    , (976883931, CRYPT_GEN_RANDOM(1024))
    , (-1424958821, CRYPT_GEN_RANDOM(1024))
    , (-279093766, CRYPT_GEN_RANDOM(1024))
    , (-903956376, CRYPT_GEN_RANDOM(1024))
    , (181119720, CRYPT_GEN_RANDOM(1024))
    , (-422397654, CRYPT_GEN_RANDOM(1024))
    , (-560438983, CRYPT_GEN_RANDOM(1024))
    , (968519165, CRYPT_GEN_RANDOM(1024))
    , (1820871210, CRYPT_GEN_RANDOM(1024))
    , (-1348787729, CRYPT_GEN_RANDOM(1024))
    , (-1869809700, CRYPT_GEN_RANDOM(1024))
    , (423340320, CRYPT_GEN_RANDOM(1024))
    , (125852107, CRYPT_GEN_RANDOM(1024))
    , (-1690550622, CRYPT_GEN_RANDOM(1024))
    , (570776311, CRYPT_GEN_RANDOM(1024))
    , (2120766755, CRYPT_GEN_RANDOM(1024))
    , (1123596784, CRYPT_GEN_RANDOM(1024))
    , (496886282, CRYPT_GEN_RANDOM(1024))
    , (-571192016, CRYPT_GEN_RANDOM(1024))
    , (1036877128, CRYPT_GEN_RANDOM(1024))
    , (1518056151, CRYPT_GEN_RANDOM(1024))
    , (1617326587, CRYPT_GEN_RANDOM(1024))
    , (410892484, CRYPT_GEN_RANDOM(1024))
    , (1826927956, CRYPT_GEN_RANDOM(1024))
    , (-1898916773, CRYPT_GEN_RANDOM(1024))
    , (245592851, CRYPT_GEN_RANDOM(1024))
    , (1826773413, CRYPT_GEN_RANDOM(1024))
    , (1451000899, CRYPT_GEN_RANDOM(1024))
    , (1234288293, CRYPT_GEN_RANDOM(1024))
    , (1433618321, CRYPT_GEN_RANDOM(1024))
    , (-1584291587, CRYPT_GEN_RANDOM(1024))
    , (-554159323, CRYPT_GEN_RANDOM(1024))
    , (-1478814392, CRYPT_GEN_RANDOM(1024))
    , (1326124163, CRYPT_GEN_RANDOM(1024))
    , (701812459, CRYPT_GEN_RANDOM(1024));

The first column is the primary key, and as you can see the values are listed in random(ish) order. Listing the values in random order should make SQL Server either:

  1. Sort the data, pre-insert
  2. Not sort the data, resulting in a fragmented table.

The CRYPT_GEN_RANDOM() function is used to generate 1024 bytes of random data per row, to allow this table to consume multiple pages, which in turn allows us to see the effects of fragmented inserts.

Once you run the above insert, you can check fragmentation like this:

SELECT * 
FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('TestSort'), 1, 0, 'SAMPLED') ips;

Running this on my SQL Server 2012 Developer Edition instance shows average fragmentation of 90%, indicating SQL Server did not sort during the insert.

The moral of this particular story is likely to be, "when in doubt, sort, if it will be beneficial". Having said that, adding and ORDER BY clause to an insert statement does not guarantee the inserts will occur in that order. Consider what happens if the insert goes parallel, as an example.

On non-production systems you can use trace flag 2332 as an option on the insert statement to "force" SQL Server to sort the input prior to inserting it. @PaulWhite has an interesting article, Optimizing T-SQL queries that change data covering that, and other details. Be aware, that trace flag is unsupported, and should NOT be used in production systems, since that might void your warranty. In a non-production system, for your own education, you can try adding this to the end of the INSERT statement:

OPTION (QUERYTRACEON 2332);

Once you have that appended to the insert, take a look at the plan, you'll see an explicit sort:

enter image description here

It would be great if Microsoft would make this a supported trace flag.

Paul White made me aware that SQL Server does automatically introduce a sort operator into the plan when it thinks one will be helpful. For the sample query above, if I run the insert with 250 items in the values clause, no sort is implemented automatically. However, at 251 items, SQL Server automatically sorts the values prior to the insert. Why the cutoff is 250/251 rows remains a mystery to me, other than it seems to be hard-coded. If I reduce the size of the data inserted in the SomeData column to just one byte, the cutoff is still 250/251 rows, even though the size of the table in both cases is just a single page. Interestingly, looking at the insert with SET STATISTICS IO, TIME ON; shows the inserts with a single byte SomeData value take twice as long when sorted.

Without the sort (i.e. 250 rows inserted):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 16 ms, elapsed time = 16 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'TestSort'. Scan count 0, logical reads 501, physical reads 0, 
   read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob 
   read-ahead reads 0.

(250 row(s) affected)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 11 ms.

With the sort (i.e. 251 rows inserted):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 17 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'TestSort'. Scan count 0, logical reads 503, physical reads 0, 
   read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob 
   read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, 
   read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob 
   read-ahead reads 0.

(251 row(s) affected)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 21 ms.

Once you start to increase the row size, the sorted version certainly becomes more efficient. When inserting 4096 bytes into SomeData, the sorted insert is nearly twice as fast on my test rig as the unsorted insert.


As a side-note, in case you're interested, I generated the VALUES (...) clause using this T-SQL:

;WITH s AS (
    SELECT v.Item
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v(Item)
)
, v AS (
    SELECT Num = CONVERT(int, CRYPT_GEN_RANDOM(10), 0)
)
, o AS (
    SELECT v.Num
        , rn = ROW_NUMBER() OVER (PARTITION BY v.Num ORDER BY NEWID())
    FROM s s1
        CROSS JOIN s s2
        CROSS JOIN s s3
        CROSS JOIN v 
)
SELECT TOP(50) ', (' 
        + REPLACE(CONVERT(varchar(11), o.Num), '*', '0') 
        + ', CRYPT_GEN_RANDOM(1024))'
FROM o
WHERE rn = 1
ORDER BY NEWID();

This generates 1,000 random values, selecting only the top 50 rows with unique values in the first column. I copied-and-pasted the output into the INSERT statement above.