SQL Server – Temp Table Clustered Key Not Honored: Bug or Expected?

sql servertemporary-tables

As I was putting some test sets of data together, I noticed some funny behavior with temp tables. When working with large sets of data in clustered temp tables that are populated via a parallel execution plan, the clustered key does not look to be honored when selecting data. This issue also seems to affect all versions of SQL Server that I've tested (include vNext).

Here's a dbfiddle.uk example of the test. You may have to execute it a couple of times to get the result I am finding, but it shouldn't take more than one or two executions to yield the same results. Additionally, this is the local execution plan I'm getting on my environment which shows that the only difference between the large and small data sets is the way data is fed into the tables (e.g. parallel vs serial plan).

If you want to play-at-home, here's the test I'm running:

-- Large Data Set
CREATE TABLE #tmp
(
    ID  INT PRIMARY KEY CLUSTERED
)

INSERT INTO #tmp
-- Purposely insert in reverse order
SELECT TOP 100 PERCENT RN
FROM
(
    SELECT TOP (10000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) x
ORDER BY RN DESC


-- Smaller Data Set
CREATE TABLE #tmp2
(
    ID  INT PRIMARY KEY CLUSTERED
)

INSERT INTO #tmp2
-- Purposely insert in reverse order
SELECT TOP 100 PERCENT RN
FROM
(
    SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) x
ORDER BY RN DESC

-- Large Record Set
-- Clustered Key Not Honored*
SELECT TOP 10 *
FROM #tmp

-- Small Record Set
-- Clustered Key Honored
SELECT TOP 10 *
FROM #tmp2

DROP TABLE #tmp
DROP TABLE #tmp2

I've not found any references indicating this is expected behavior, but before I submit a connect item, I first wanted to reach out and confirm this isn't a localized problem. Can someone either point me to documentation identifying this is expected behavior or alternatively confirm this is, in-fact a bug?

EDIT: In response to the comments about not including an ORDER BY clause, I was always under the assumption the TOP keyword returned the data in the order in which it was inserted, which should, in this case, be the order dictated by the clustered key. When running the same statement against a formal table, the expected behavior is returned:

-- Large Data Set with a Formal Data Table
CREATE TABLE tmp
(
    ID  INT PRIMARY KEY CLUSTERED
)

INSERT INTO tmp
-- Purposely insert in reverse order
SELECT TOP 100 PERCENT RN
FROM
(
    SELECT TOP (10000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) x
ORDER BY RN DESC

-- Large Record Set
-- Clustered Key Not Honored*
SELECT TOP 10 *
FROM tmp

DROP TABLE tmp

(6325225 row(s) affected)


(1 row(s) affected)
ID
-----------
1
2
3
4
5
6
7
8
9
10

(10 row(s) affected)



(1 row(s) affected)

Even the execution plans are the same, so why the different result sets between a temp table and a formally defined table?

Finally, a shout out to Joe Obbish as I gratuitously ripped off his CROSS JOIN approach to build large sets of test data as it's quite efficient!

Best Answer

There is no guarantee of ORDER without ORDER BY.

The execution plan for both has "Ordered = False".

enter image description here

This means you may get the results in key order but equally may not.

Specifically see When can allocation order scans be used?

The only time such a scan will be used is when there’s no possibility of the data changing (e.g. when the TABLOCK hint is specified, or when the table is in a read-only database) or when its explicitly stated that we don’t care (e.g. when the NOLOCK hint is specifed or under READ UNCOMMITTED isolation level). As a further twist, there’s a trade-off with setup cost of the allocation order scan against the number of pages that will b read – an allocation order scan will only be used if there’s more than 64 pages to be read.

As the local temp table is not accessible to other connections you get this behaviour without explicitly taking a table lock however the comment about size of table still applies which is why you see the difference in your two cases.

If you need a specific order add an ORDER BY to get scan in key order (with "Ordered = True").

Related Solutions

SQL Server Sorting – Sort Order in Primary Key Yet Sorting Executed on SELECT

For a non partitioned table I get the following plan

Plan 1

There is a single seek predicate on Seek Keys[1]: Prefix: DeviceId, SensorId = (3819, 53), Start: Date < 1339225010.

Meaning that SQL Server can perform an equality seek on the first two columns and then begin a range seek starting at 1339225010 and ordered FORWARD (as the index is defined with [Date] DESC)

The TOP operator will stop requesting more rows from the seek after the first row is emitted.

When I create the partition scheme and function

CREATE PARTITION FUNCTION PF (int)
AS RANGE LEFT FOR VALUES (1000, 1339225009 ,1339225010 , 1339225011);
GO
CREATE PARTITION SCHEME [MyPartitioningScheme]
AS PARTITION PF
ALL TO ([PRIMARY] );

And populate the table with the following data

INSERT INTO [dbo].[SensorValues]    
/*500 rows matching date and SensorId, DeviceId predicate*/
SELECT TOP (500) 3819,53,1, ROW_NUMBER() OVER (ORDER BY (SELECT 0))           
FROM master..spt_values
UNION ALL
/*700 rows matching date but not SensorId, DeviceId predicate*/
SELECT TOP (700) 3819,52,1, ROW_NUMBER() OVER (ORDER BY (SELECT 0))           
FROM master..spt_values
UNION ALL 
/*1100 rows matching SensorId, DeviceId predicate but not date */
SELECT TOP (1100) 3819,53,1, ROW_NUMBER() OVER (ORDER BY (SELECT 0)) + 1339225011      
FROM master..spt_values

The plan on SQL Server 2008 looks as follows.

Plan 2

The actual number of rows emitted from the seek is 500. The plan shows seek predicates

Seek Keys[1]: Start: PtnId1000 <= 2, End: PtnId1000 >= 1, 
Seek Keys[2]: Prefix: DeviceId, SensorId = (3819, 53), Start: Date < 1339225010

Indicating it is using the skip scan approach described here

the query optimizer is extended so that a seek or scan operation with one condition can be done on PartitionID (as the logical leading column) and possibly other index key columns, and then a second-level seek, with a different condition, can be done on one or more additional columns, for each distinct value that meets the qualification for the first-level seek operation.

This plan is a serial plan and so for the specific query you have it seems that if SQL Server ensured that it processed the partitions in descending order of date that the original plan with the TOP would still work and it could stop processing after the first matching row was found rather than continuing on and outputting the remaining 499 matches.

In fact the plan on 2005 looks like it does take that approach

Plan on 2005

I'm not sure if it is straight forward to get the same plan on 2008 or maybe it would need an OUTER APPLY on sys.partition_range_values to simulate it.

Sql-server – Why are timestamps not always increasing with concurrent inserts

The IDENTITY generator is not well documented. There are some behaviors however that can be observed that seem relevant:

The identity generation does not get affected by transactions. That means once a value has been used it will not be reused, even if the transaction causing its use is rolled back.
Not every use causes an update of the sequence position being written back to the database. You can see that for example after a crash. Often the next used value after a crash is several numbers higher than the previous.

While there is no proof (meaning documentation), it can be assumed that for performance reasons a multi-row insert grabs a block of identity values and uses them until it runs out. Another concurrent thread will get the next block of numbers. At this point the identity value does not actually reflect the order of inserts anymore.

The rowversion data type on the other hand is an ever increasing number that would reflect insert order. (timestamp is a deprecated synonym for rowversion.)

So in your case you can assume that the rows were inserted in the order of the rowversion column and that the out-of-order identity value is caused by in memory performance optimizations.

By the way, while the IDENTITY generator is not very well documented, the new 2012 SEQUENCE functionality is. Here you can read all about the behaviors described above in sequences.

As for your concern with replication:

Transactional replication is using the database log and does not rely on specific column values.
Merge replication uses a rowguid column to identify a row. This is a column that gets valued once and does not change throughout the life of the row. Merge replication does not use a rowversion column. Transactional consistency is enforced by the fact that at the time of a synchronization, normal locking is used, so a transaction is either completely visible to the merge agent or completely invisible.
Snapshot replication does not look for changes at all. It just takes the at the time of the synchronization committed data and copies it over.

Best Answer

Related Solutions

SQL Server Sorting – Sort Order in Primary Key Yet Sorting Executed on SELECT

Sql-server – Why are timestamps not always increasing with concurrent inserts

Related Question