I haven't managed to reproduce this after running your code a few times.
I presume that it must happen when a later row gets inserted onto an earlier page in the file though.
So the order of operations is (for example)
- Rows inserted into heap on pages 200, 207, 223
- Select statement starts and performs an allocation ordered scan. Finds that the first page is 200 and is blocked waiting on a row lock to be released.
- Other rows are inserted by the first transaction. Some of them are allocated on a page before 200. Insert transaction commits.
- Row lock released and continues allocation ordered scan. Rows earlier in the file are missed.
The table comprised 10 pages. By default the first 8 pages will be allocated from mixed extents and then it will be allocated a uniform extent. Maybe in your case space was available in the file for a free uniform extent prior to the mixed extents that were used.
You can test this theory by running the following in a different window after you have reproduced the issue and seeing if the missing rows from the original SELECT
all appear at the beginning of this resultset.
SELECT [SomeData],
Moment,
SomeInt,
file_id,
page_id,
slot_id
FROM [SomeTable]
/*Undocumented - Use at own risk*/
CROSS APPLY sys.fn_PhysLocCracker(%% physloc %%)
ORDER BY page_id, SomeInt
The operation against an indexed table will be in index key order rather than allocation order so will not be affected by this particular scenario.
An allocation ordered scan can be carried out against an index but it is only considered if the table is sufficiently large and the isolation level is read uncommitted or a table lock is held.
Because read committed generally releases locks as soon as the data is read it is possible at for a scan against the index to read rows twice or not at all (if the index key is updated by a concurrent transaction causing the row to move forward or back) See The Read Committed Isolation Level for more discussion about this type of issue.
By the way I was originally envisaging for the indexed case that the index was on one of the columns that increases relative to insert order (any of Id, Moment, SomeInt). However even if the clustered index is on the random SomeData
the issue still doesn't arise.
I tried
DBCC TRACEON(3604, 1200, -1) /*Caution. Global trace flag. Outputs lock info
on every connection*/
SELECT TOP 2 *,
%%LOCKRES%%
FROM [SomeTable] WITH(nolock)
ORDER BY [SomeData];
SELECT *,
%%LOCKRES%%
FROM [SomeTable]
ORDER BY [SomeData];
/*Turn off trace flags. Doesn't check whether or not they were on already
before we started, with TRACEOFF*/
DBCC TRACEOFF(3604, 1200, -1)
Results were as below
The second resultset includes all 1,000 rows. The locking info shows that even though it was blocked waiting on lock resource 24c910701749
when the lock was released it doesn't just continue the scan from that point. Instead it immediately releases that lock and acquires a row lock on the new first row.
It looks like SQL Server is generating a parameterized query plan that can work for any value of @CustomerPartitionKey. In order to do so, it seems to treat @CustomerPartitionKey as both a partition and a column you are seeking upon. If we take a look at the query plan where we have the bad estimate (3000 rows estimated, 300000 actual), we see that there are actually two separate seek predicates on part.Contact
related to @CustomerPartitionKey
:
Seek Keys[1]: Prefix: PtnId1004, [Test].[part].[Contact].CustomerPartitionKey = Scalar Operator([Expr1008]), Scalar Operator([@CustomerPartitionKey])
I think that the latter ([Test].[part].[Contact].CustomerPartitionKey = Scalar Operator([@CustomerPartitionKey])
is able to get a proper estimate based on parameter sniffing for the value of @CustomerPartitionKey
. However, the former (Prefix: PtnId1004 = Scalar Operator([Expr1008])
) is likely not able to do so, perhaps because Expr1008
is complicated expression that processes partition elimiation: [Expr1008]=RangePartitionNew([@CustomerPartitionKey],(0),(0),(1),(2),...,(97),(98))
.
In this case, there are 100 partitions and the row estimate is exactly 100 times too low because SQL Server isn't able to process the partition elimination in the same smart way that it processes the actual seek on the column and uses an estimate for the runtime parameter value of 3. This theory is supported by the way that the estimated rows varies if you remove partitions; if you use 90 partitions instead, the estimate will be 3333.33 (300000 / 90).
In our own queries, we typically use a literal (e.g., 3
in this case) or use OPTION RECOMPILE whenever we are writing a query that is going to take advantage of partition elimination. This practice has worked fairly well for us given that the number of queries on the system is modest and query compilation overhead for queries against large partitioned tables is not a concern for us. Not necessarily a satisfying answer, but it might work for you.
Best Answer
You're talking about a
CASE
expression:My example of a
CASE
expression is Standard SQL and not specific to just the T-SQL syntax.Check here, if you'd like to learn more about the implementation of the CASE expression.