Sql-server – How is SQL Server returning both a new value and old value during an UPDATE

concurrencylockingsql-server-2008-r2

We've had issues, during high concurrency, of queries returning non-sensical results – results the violate the logic of the queries being issued. It took a while to reproduce the issue. I've managed to distil the reproducible problem down to a few handfuls of T-SQL.

Note: The part of the live system having the issue is composed of 5 tables, 4 triggers, 2 stored procedures, and 2 views. I've simplified down the real system into something much more manageable for a posted question. Things have been pared down, columns removed, stored procedures made inline, views turned into common table expressions, values of columns changed. This is all a long way of saying that while what follows reproduces an error, it may be more difficult to understand. You'll have to refrain from wondering why something is structured the way it is. I'm here trying to figure out why the error condition reproducibly happens in this toy model.

/*
The idea in this system is that people are able to take days off. 
We create a table to hold these *"allocations"*, 
and declare sample data that only **1** production operator 
is allowed to take time off:
*/
IF OBJECT_ID('Allocations') IS NOT NULL DROP TABLE Allocations
CREATE TABLE [dbo].[Allocations](
    JobName varchar(50) PRIMARY KEY NOT NULL,
    Available int NOT NULL
)
--Sample allocation; there is 1 avaialable slot for this job
INSERT INTO Allocations(JobName, Available)
VALUES ('Production Operator', 1);

/*
Then we open up the system to the world, and everyone puts in for time. 
We store these requests for time off as *"transactions"*. 
Two production operators requested time off. 
We create sample data, and note that one of the users 
created their transaction first (by earlier CreatedDate):
*/
IF OBJECT_ID('Transactions') IS NOT NULL DROP TABLE Transactions;
CREATE TABLE [dbo].[Transactions](
    TransactionID int NOT NULL PRIMARY KEY CLUSTERED,
    JobName varchar(50) NOT NULL,
    ApprovalStatus varchar(50) NOT NULL,
    CreatedDate datetime NOT NULL
)
--Two sample transactions
INSERT INTO Transactions (TransactionID, JobName, ApprovalStatus, CreatedDate)
VALUES (52625, 'Production Operator', 'Booked', '20140125 12:00:40.820');
INSERT INTO Transactions (TransactionID, JobName, ApprovalStatus, CreatedDate)
VALUES (60981, 'Production Operator', 'WaitingList', '20150125 12:19:44.717');

/*
The allocation, and two sample transactions are now in the database:
*/
--Show the sample data
SELECT * FROM Allocations
SELECT * FROM Transactions

The transactions are both inserted as WaitingList. Next we have a periodic task that runs, looking for empty slots and bumps anyone on the WaitingList into a Booked status.

In a separate SSMS window, we have the simulated recurring stored procedure:

/*
    Simulate recurring task that looks for empty slots, 
    and bumps someone on the waiting list into that slot.
*/
SET NOCOUNT ON;

--Reset the faulty row so we can continue testing
UPDATE Transactions SET ApprovalStatus = 'WaitingList'
WHERE TransactionID = 60981

--DBCC TRACEON(3604,1200,3916,-1) WITH NO_INFOMSGS

DECLARE @attempts int
SET @attempts = 0;

WHILE (@attempts < 1000000)
BEGIN
    SET @attempts = @attempts+1;

    /*
        The concept is that if someone is already "Booked", then they occupy an available slot.
        We compare the configured amount of allocations (e.g. 1) to how many slots are used.
        If there are any slots leftover, then find the **earliest** created transaction that 
        is currently on the WaitingList, and set them to Booked.
    */

    PRINT '=== Looking for someone to bump ==='
    WITH AvailableAllocations AS (
        SELECT 
            a.JobName,
            a.Available AS Allocations, 
            ISNULL(Booked.BookedCount, 0) AS BookedCount, 
            a.Available-ISNULL(Booked.BookedCount, 0) AS Available
        FROM Allocations a
            FULL OUTER JOIN (
                SELECT t.JobName, COUNT(*) AS BookedCount
                FROM Transactions t
                WHERE t.ApprovalStatus IN ('Booked') 
                GROUP BY t.JobName
            ) Booked
            ON a.JobName = Booked.JobName
        WHERE a.Available > 0
    )
    UPDATE Transactions SET ApprovalStatus = 'Booked'
    WHERE TransactionID = (
        SELECT TOP 1 t.TransactionID
        FROM AvailableAllocations aa
            INNER JOIN Transactions t
            ON aa.JobName = t.JobName
            AND t.ApprovalStatus = 'WaitingList'
        WHERE aa.Available > 0
        ORDER BY t.CreatedDate 
    )


    IF EXISTS(SELECT * FROM Transactions WHERE TransactionID = 60981 AND ApprovalStatus = 'Booked')
    begin
        --DBCC TRACEOFF(3604,1200,3916,-1) WITH NO_INFOMSGS
        RAISERROR('The later tranasction, that should never be booked, managed to get booked!', 16, 1)
        BREAK;
    END
END

And finally run this in a 3rd SSMS connection window. This simulates a concurrency issue where the earlier transaction goes from taking up a slot, to being on the waiting list:

/*
    Toggle the earlier transaction back to "WaitingList".
    This means there are two possibilies:
       a) the transaction is "Booked", meaning no slots are available. 
          Therefore nobody should get bumped into "Booked"
       b) the transaction is "WaitingList", 
          meaning 1 slot is open and both tranasctions are "WaitingList"
          The earliest transaction should then get "Booked" into the slot.

    There is no time when there is an open slot where the 
    first transaction shouldn't be the one to get it - he got there first.
*/
SET NOCOUNT ON;

--Reset the faulty row so we can continue testing
UPDATE Transactions SET ApprovalStatus = 'WaitingList'
WHERE TransactionID = 60981

DECLARE @attempts int
SET @attempts = 0;

WHILE (@attempts < 100000)
BEGIN
    SET @attempts = @attempts+1

    /*Flip the earlier transaction from Booked back to WaitingList
        Because it's now on the waiting list -> there is a free slot.
        Because there is a free slot -> a transaction can be booked.
        Because this is the earlier transaction -> it should always be chosen to be booked
    */
    --DBCC TRACEON(3604,1200,3916,-1) WITH NO_INFOMSGS

    PRINT '=== Putting the earlier created transaction on the waiting list ==='

    UPDATE Transactions
    SET ApprovalStatus = 'WaitingList'
    WHERE TransactionID = 52625

    --DBCC TRACEOFF(3604,1200,3916,-1) WITH NO_INFOMSGS

    IF EXISTS(SELECT * FROM Transactions WHERE TransactionID = 60981 AND ApprovalStatus = 'Booked')
    begin
        RAISERROR('The later tranasction, that should never be booked, managed to get booked!', 16, 1)
        BREAK;
    END
END

Conceptually, the bumping procedure keeps looking for any empty slots. If it finds one, it takes the earliest transaction that is on the WaitingList and marks it as Booked.

When tested without concurrency, the logic works. We have two transactions:

12:00 pm: WaitingList
12:20 pm: WaitingList

There is 1 allocation, and 0 booked transactions, so we mark the earlier transaction as booked:

12:00 pm: Booked
12:20 pm: WaitingList

The next time the task runs, there is now 1 slot being taken up – so there is nothing to update.

If we then update the first transaction, and put it onto the WaitingList:

UPDATE Transactions SET ApprovalStatus='WaitingList'
WHERE TransactionID = 60981

Then we're back where we started:

12:00 pm: WaitingList
12:20 pm: WaitingList

Note: You may be wondering why i'm putting a transaction back on the waiting list. That's a casualty of the simplified toy model. In the real system transactions can be PendingApproval, which also occupies a slot. A PendingApproval transaction is put onto the waiting list when it is approved. Doesn't matter. Don't worry about it.

But when i introduce concurrency, by having a second window constantly putting the first transaction back onto the waiting list after being booked, then the later transaction managed to get the booking:

12:00 pm: WaitingList
12:20 pm: Booked

The toy test scripts catch this, and stop iterating:

Msg 50000, Level 16, State 1, Line 41
The later tranasction, that should never be booked, managed to get booked!

Why?

The question is, why in this toy model, is this bail-out condition being triggered?

There are two possible states for the first transaction's approval status:

Booked: in which case the slot is occupied, and the later transaction cannot have it
WaitingList: in which case there is one empty slot, and two transactions who want it. But since we always select the oldest transaction (i.e. ORDER BY CreatedDate) the first transaction should get it.

I thought maybe because of other indexes

I learned that after an UPDATE has started, and data has been modified, it is possible to read the old values. In the initial conditions:

Clustered index: Booked
Non-clustered index: Booked

Then i do an update, and while the clustered index leaf node has been modified, any non-clustered indexes still contain the original value and are still available for reading:

Clustered index (Exclusive lock): ~~Booked~~ WaitingList
Non-clustered index: (unlocked) Booked

But that doesn't explain the observed problem. Yes the transaction is no longer Booked, meaning there is now an empty slot. But that change is not committed yet, it is still exclusively held. If the bumping procedure ran, it would either:

block: if snapshot isolation database option is off
read the old value (e.g. Booked): if snapshot isolation is on

Either way, the bumping job would not know there is an empty slot.

So i have no idea

We've been struggling for days for figure out how these nonsensical results could happen.

You may not understand the original system, but there's a set of toy reproducible scripts. They bail out when the invalid case is detected. Why is it being detected? Why is it happening?

Bonus Question

How does NASDAQ solve this? How does cavirtex? How does mtgox?

tl;dr

There's three script blocks. Put them into 3 separate SSMS tabs and run them. The 2nd and 3rd scripts will raise an error. Help me figure out why they error appears.

Best Answer

The default READ COMMITTED transaction isolation level guarantees that your transaction will not read uncommitted data. It does not guarantee that any data you read will remain the same if you read it again (repeatable reads) or that new data will not appear (phantoms).

These same considerations apply to multiple data accesses within the same statement.

Your UPDATE statement produces a plan that accesses the Transactions table more than once, so it is susceptible to effects caused by non-repeatable reads and phantoms.

Multiple access

There are multiple ways for this plan to produce results you do not expect under READ COMMITTED isolation.

An example

The first Transactions table access finds rows that have a status of WaitingList. The second access counts the number of entries (for the same job) that have a status of Booked. The first access may return only the later transaction (the earlier one is Booked at this point). When the second (counting) access occurs, the earlier transaction has been changed to WaitingList. The later row therefore qualifies for the update to Booked status.

Solutions

There are several ways to set the isolation semantics to get the results you are after. One option is to enable READ_COMMITTED_SNAPSHOT for the database. This provides statement-level read consistency for statements running at the default isolation level. Non-repeatable reads and phantoms are not possible under read committed snapshot isolation.

Other remarks

I have to say though that I would not have designed the schema or query this way. There is rather more work involved than should be necessary to meet the stated business requirement. Perhaps this is partly the result of the simplifications in the question, in any case that is a separate question.

The behaviour you are seeing does not represent a bug of any kind. The scripts produce correct results given the requested isolation semantics. Concurrency effects like this are also not limited to plans which access data multiple times.

The read committed isolation level provides many fewer guarantees than are commonly assumed. For example, skipping rows and/or reading the same row more than once is perfectly possible.

Why?

I thought maybe because of other indexes

So i have no idea

Bonus Question

tl;dr

Best Answer

An example

Solutions

Other remarks

Related Solutions

Sql-server – Does SQL Server check disk space before allocating a new page

Sql-server – the best design for dynamic assignment of locations for variable sized grid boxes to occupy

Related Question