Sql-server – Why does NOLOCK make a scan with variable assignment slower

sql serversql-server-2016

I'm fighting against NOLOCK in my current environment. One argument I've heard is that the overhead of locking slows down a query. So, I devised a test to see just how much this overhead might be.

I discovered that NOLOCK actually slows down my scan.

At first I was delighted, but now I 'm just confused. Is my test invalid somehow? Shouldn't NOLOCK actually allow a slightly faster scan? What's happening here?

Here's my script:

USE TestDB
GO

--Create a five-million row table
DROP TABLE IF EXISTS dbo.JustAnotherTable
GO

CREATE TABLE dbo.JustAnotherTable (
ID INT IDENTITY PRIMARY KEY,
notID CHAR(5) NOT NULL )

INSERT dbo.JustAnotherTable
SELECT TOP 5000000 'datas'
FROM sys.all_objects a1
CROSS JOIN sys.all_objects a2
CROSS JOIN sys.all_objects a3

/********************************************/
-----Testing. Run each multiple times--------
/********************************************/
--How fast is a plain select? (I get about 587ms)
DECLARE @trash CHAR(5), @dt DATETIME = SYSDATETIME()

SELECT @trash = notID  --trash variable prevents any slowdown from returning data to SSMS
FROM dbo.JustAnotherTable
ORDER BY ID
OPTION (MAXDOP 1)

SELECT DATEDIFF(MILLISECOND,@dt,SYSDATETIME())

----------------------------------------------
--Now how fast is it with NOLOCK? About 640ms for me
DECLARE @trash CHAR(5), @dt DATETIME = SYSDATETIME()

SELECT @trash = notID
FROM dbo.JustAnotherTable (NOLOCK)
ORDER BY ID --would be an allocation order scan without this, breaking the comparison
OPTION (MAXDOP 1)

SELECT DATEDIFF(MILLISECOND,@dt,SYSDATETIME())

What I've tried that didn't work:

Running on different servers (same results, servers were 2016-SP1 and 2016-SP2, both quiet)
Running on dbfiddle.uk on different versions (noisy, but probably
same results)
SET ISOLATION LEVEL instead of hints (same results)
Turning off lock escalation on the table (same results)
Examining actual execution time of the scan in the actual query plan
(same results)
Recompile hint (same results)
Read only filegroup (same results)

The most promising exploration comes from removing the trash variable and using a no-results query. Initially this showed NOLOCK as slightly faster, but when I showed the demo to my boss, NOLOCK was back to being slower.

What is it about NOLOCK that slows down a scan with variable assignment?

Best Answer

NOTE: this might not be the type of answer you're looking for. But perhaps it will be helpful to other potential answerers as far as providing clues as where to start looking

When I run these queries under ETW tracing (using PerfView), I get the following results:

Plain  - 608 ms  
NOLOCK - 659 ms

So the difference is 51ms. This is pretty dead on with your difference (~50ms). My numbers are slightly higher overall because of the profiler sampling overhead.

Finding the difference

Here's a side-by-side comparison showing that the 51ms difference is in the FetchNextRow method in sqlmin.dll:

Plain select is on the left at 332 ms, while the nolock version is on the right at 383 (51ms longer). You can also see that the two code paths differ in this way:

Plain SELECT
- sqlmin!RowsetNewSS::FetchNextRow calls
  - sqlmin!IndexDataSetSession::GetNextRowValuesInternal
Using NOLOCK
- sqlmin!RowsetNewSS::FetchNextRow calls
  - sqlmin!DatasetSession::GetNextRowValuesNoLock which calls either
    - sqlmin!IndexDataSetSession::GetNextRowValuesInternal or
    - kernel32!TlsGetValue

This shows that there is some branching in the FetchNextRow method based on the isolation level / nolock hint.

Why does the `NOLOCK` branch take longer?

The nolock branch actually spends less time calling into GetNextRowValuesInternal (25ms less). But the code directly in GetNextRowValuesNoLock (not including methods it calls AKA the "Exc" column) runs for 63ms - which is the majority of the difference (63 - 25 = 38ms net increase in CPU time).

So what's the other 13ms (51ms total - 38ms accounted for so far) of overhead in FetchNextRow?

Interface dispatch

I thought this was more of a curiosity than anything, but the nolock version appears to incur some interface dispatch overhead by calling into the Windows API method kernel32!TlsGetValue via kernel32!TlsGetValueStub - a total of 17ms. The plain select appears to not go through the interface, so it never hits the stub, and only spends 6ms on TlsGetValue (a difference of 11ms). You can see this above in the first screenshot.

_{I should probably run this trace again with more iterations of the query, I think there are some small things, like hardware interrupts, that were not picked up by the 1ms sample rate of PerfView}

Outside of that method, I noticed another small difference that causes the nolock version to run slower:

Releasing Locks

The nolock branch appears to more aggressively run the sqlmin!RowsetNewSS::ReleaseRows method, which you can see in this screenshot:

The plain select is on top, at 12ms, while the nolock version is on the bottom at 26ms (14ms longer). You can also see in the "When" column that the code was executed more frequently during the sample. This may be an implementation detail of nolock, but it seems to introduce quite a bit of overhead for small samples.

There are lots of other small differences, but those are the big chunks.

Best Answer

Finding the difference

Why does the NOLOCK branch take longer?

Interface dispatch

Releasing Locks

Related Solutions

Sql-server – SQL Server 2008 datetime index performance bug

Sql-server – NEWID() In Joined Virtual Table Causes Unintended Cross Apply Behavior

Related Question

Why does the `NOLOCK` branch take longer?