Sql-server – Why do updates to separate sets of rows deadlock

deadlocklockingsql serversql-server-2012

_{Update: I have resolved my problem by switching to optimistic locking (SNAPSHOT isolation). The question of why that was necessary in the first place remains unanswered, but I had to move on and thus consider the question purely academic at this point.}

I am experiencing deadlocks between separate instances of a stored procedure that I do not understand.

My system collects data from manufacturing machines in three plants. Various dashboards select data from this database, but updates/inserts are primarily done through a single stored procedure: saveCounts. There is a script for each plant that calls this procedure once per minute, passing in the new data for the machines in that plant. The procedure is ~700 lines and consists of a series of queries, so I'm not going to include the whole thing here. The parts of it that are producing deadlocks are several updates to the table countLogLast, as shown in the deadlock report below.

The tables involved look like this:

countLogLast
- tiny table (<200 rows) that just stores the latest information about each machine
- primary key is machineID INT
countLog
- table with millions of rows of historical data
- primary key is (machineID INT, countStamp DATETIME)
@shortStopDowntime and @falseDowntime
- tiny table variables (<10 rows) storing a subset of the machines from the relevant plant
- primary key is machineID INT

Here are the deadlock graph and XML report:

<deadlock>
 <victim-list>
  <victimProcess id="process8200f3868" />
 </victim-list>
 <process-list>
  <process id="process8200f3868" taskpriority="0" logused="6152" waitresource="KEY: 7:72057594089504768 (8194443284a0)" waittime="205" ownerId="95950131" transactionname="user_transaction" lasttranstarted="2020-02-24T06:35:03.737" XDES="0x8dca5740" lockMode="U" schedulerid="2" kpid="10416" status="suspended" spid="60" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2020-02-24T06:35:03.733" lastbatchcompleted="2020-02-24T06:35:03.733" lastattention="1900-01-01T00:00:00.733" hostname="ELGSQL01" hostpid="6376" loginname="SPC" isolationlevel="read uncommitted (1)" xactid="95950131" currentdb="7" lockTimeout="4294967295" clientoption1="673316896" clientoption2="128056">
   <executionStack>
    <frame procname="Andon.dbo.saveCounts" line="428" stmtstart="37104" stmtend="37906" sqlhandle="0x03000700c215e36c43eba20068ab000001000000000000000000000000000000000000000000000000000000">
        UPDATE old
        SET
            lastDownStamp = c.countStamp
        FROM
            countLogLast AS old
            JOIN countLog AS c ON c.machineID = old.machineID
        WHERE
            old.machineID IN (SELECT machineID FROM @shortStopDowntime) AND
            c.countStamp = (
                SELECT MAX(c2.countStamp)
                FROM countLog AS c2
                WHERE
                    c2.machineID = c.machineID AND
                    c2.reasonID != @REASON_RUNNING
            )
    </frame>
    <frame procname="adhoc" line="31" stmtstart="5116" sqlhandle="0x020000003f92ac38a5f767af9f699bc89e1c4b0c19c05dff0000000000000000000000000000000000000000">
EXEC dbo.saveCounts @countTable, @P109    </frame>
    <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
(@P1 varchar(2),@P2 varchar(1),@P3 varchar(1),@P4 varchar(1),@P5 varchar(2),@P6 varchar(1),@P7 varchar(1),@P8 varchar(1),@P9 varchar(2),@P10 varchar(1),@P11 varchar(1),@P12 varchar(1),@P13 varchar(2),@P14 varchar(1),@P15 varchar(1),@P16 varchar(1),@P17 varchar(2),@P18 varchar(2),@P19 varchar(2),@P20 varchar(1),@P21 varchar(2),@P22 varchar(1),@P23 varchar(1),@P24 varchar(1),@P25 varchar(2),@P26 varchar(3),@P27 varchar(1),@P28 varchar(1),@P29 varchar(2),@P30 varchar(3),@P31 varchar(2),@P32 varchar(1),@P33 varchar(2),@P34 varchar(1),@P35 varchar(1),@P36 varchar(1),@P37 varchar(2),@P38 varchar(3),@P39 varchar(1),@P40 varchar(1),@P41 varchar(2),@P42 varchar(1),@P43 varchar(1),@P44 varchar(1),@P45 varchar(2),@P46 varchar(1),@P47 varchar(1),@P48 varchar(1),@P49 varchar(2),@P50 varchar(1),@P51 varchar(1),@P52 varchar(1),@P53 varchar(2),@P54 varchar(1),@P55 varchar(1),@P56 varchar(1),@P57 varchar(2),@P58 varchar(1),@P59 varchar(1),@P60 varchar(1),@P61 varchar(2),@P62 varchar(1),@P63 varchar(1),@P64 varchar(1),@P65 va   </inputbuf>
  </process>
  <process id="process276716558" taskpriority="0" logused="12640" waitresource="KEY: 7:72057594089504768 (879d521ac67a)" waittime="205" ownerId="95950768" transactionname="user_transaction" lasttranstarted="2020-02-24T06:35:10.063" XDES="0x274147740" lockMode="U" schedulerid="4" kpid="9984" status="suspended" spid="56" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2020-02-24T06:35:09.950" lastbatchcompleted="2020-02-24T06:35:09.950" lastattention="1900-01-01T00:00:00.950" hostname="ELGSQL01" hostpid="3060" loginname="SPC" isolationlevel="read uncommitted (1)" xactid="95950768" currentdb="7" lockTimeout="4294967295" clientoption1="673316896" clientoption2="128056">
   <executionStack>
    <frame procname="Andon.dbo.saveCounts" line="353" stmtstart="29600" stmtend="34580" sqlhandle="0x03000700c215e36c43eba20068ab000001000000000000000000000000000000000000000000000000000000">
        UPDATE old
        SET
            lastDownReasonID = c.reasonID,
            lastDownStamp = c.countStamp
        FROM
            countLogLast AS old
            JOIN countLog AS c ON c.machineID = old.machineID
        WHERE
            old.machineID IN (SELECT machineID FROM @falseDowntime) AND
            c.countStamp = (
                SELECT MAX(c2.countStamp)
                FROM countLog AS c2
                WHERE
                    c2.machineID = c.machineID AND
                    c2.reasonID != @REASON_RUNNING
            )
    <frame procname="adhoc" line="63" stmtstart="11452" sqlhandle="0x02000000a477fc2448c590d65bbade781893c643435fa3390000000000000000000000000000000000000000">
EXEC dbo.saveCounts @countTable, @P237    </frame>
    <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
(@P1 varchar(2),@P2 varchar(1),@P3 varchar(1),@P4 varchar(1),@P5 varchar(2),@P6 varchar(2),@P7 varchar(2),@P8 varchar(1),@P9 varchar(2),@P10 varchar(1),@P11 varchar(1),@P12 varchar(1),@P13 varchar(2),@P14 varchar(3),@P15 varchar(3),@P16 varchar(1),@P17 varchar(2),@P18 varchar(1),@P19 varchar(1),@P20 varchar(1),@P21 varchar(2),@P22 varchar(1),@P23 varchar(1),@P24 varchar(1),@P25 varchar(2),@P26 varchar(1),@P27 varchar(1),@P28 varchar(1),@P29 varchar(2),@P30 varchar(1),@P31 varchar(1),@P32 varchar(1),@P33 varchar(2),@P34 varchar(1),@P35 varchar(3),@P36 varchar(1),@P37 varchar(2),@P38 varchar(1),@P39 varchar(1),@P40 varchar(1),@P41 varchar(2),@P42 varchar(1),@P43 varchar(1),@P44 varchar(1),@P45 varchar(2),@P46 varchar(3),@P47 varchar(3),@P48 varchar(1),@P49 varchar(2),@P50 varchar(1),@P51 varchar(1),@P52 varchar(1),@P53 varchar(2),@P54 varchar(1),@P55 varchar(1),@P56 varchar(1),@P57 varchar(2),@P58 varchar(2),@P59 varchar(2),@P60 varchar(1),@P61 varchar(2),@P62 varchar(2),@P63 varchar(1),@P64 varchar(1),@P65 va   </inputbuf>
  </process>
 </process-list>
 <resource-list>
  <keylock hobtid="72057594089504768" dbid="7" objectname="Andon.dbo.countLogLast" indexname="1" id="lock650ca4c00" mode="U" associatedObjectId="72057594089504768">
   <owner-list>
    <owner id="process276716558" mode="U" />
   </owner-list>
   <waiter-list>
    <waiter id="process8200f3868" mode="U" requestType="wait" />
   </waiter-list>
  </keylock>
  <keylock hobtid="72057594089504768" dbid="7" objectname="Andon.dbo.countLogLast" indexname="1" id="lock27baa6780" mode="X" associatedObjectId="72057594089504768">
   <owner-list>
    <owner id="process8200f3868" mode="X" />
   </owner-list>
   <waiter-list>
    <waiter id="process276716558" mode="U" requestType="wait" />
   </waiter-list>
  </keylock>
 </resource-list>
</deadlock>

Here's what the actual query plan looks like for each of the two UPDATE queries in the deadlock, if I run them manually (in the same order as in the XML report above):

Given that every single query in the procedure only operates on the input set of machines (or a subset of it) and that the three instances of the script feed the procedure mutually exclusive sets of machines, how could there possibly be a deadlock? When I look at the deadlock report, I see key locks, not page or object locks.
When I look at the query plans, I see that countLogLast is never being scanned, only seeked, so I don't see why irrelevant rows would be locked. What am I failing to understand?

As far as solutions go, I'm leaning toward enabling snapshot isolation and using it for this procedure since I know that there aren't really any data conflicts (just locking conflicts), and therefore, snapshot should allow for perfect concurrency between these procedure instances without doing any harm. But it bothers me that I don't understand why this problem is happening in the first place, and I'm curious about other suggestions.

Update 1

Josh Darnell smartly pointed out in his answer below that the problem might be that a given plant is calling this procedure again before the prior call has finished. Below, I added an excerpt from the very beginning of this saveCounts procedure that should prevent that possibility, but maybe it is not doing so?

What it does is take a lock for each machine using sp_getapplock. The reason I did this is to prevent the one other source of updates to these tables (another stored procedure) from working with a given machine's data at the same time as this procedure. I would think that a bonus perk of this scheme would be that it would make additional instances the saveCounts procedure for a given plant wait to acquire the relevant machines' locks, preventing the deadlocks I'm seeing.

Given that this procedure has taken as long as 214 seconds to execute, that the script at each plant calls it every 60 seconds, and that I set a 20-second timeout on sp_getapplock, I would think that if two calls from the same plant were to overlap, I'd get sp_getapplock timeouts rather than these deadlocks. I assume my script is simply blocking until the procedure completes before looping, meaning that sporadic execution times of more than 60 seconds shouldn't matter.
The lack of timeouts seems to support that assumption, but I will look into it.

ALTER PROCEDURE dbo.saveCounts
    @counts type_countTableInput READONLY,

...

BEGIN TRANSACTION

    DECLARE
        @machineID INT,
        @lockName VARCHAR(25),
        @errorMsg VARCHAR(MAX),
        @result INT;

    DECLARE cursorMachine CURSOR FOR SELECT machineID FROM @counts;

    OPEN cursorMachine;
    FETCH NEXT FROM cursorMachine INTO @machineID;

    WHILE @@FETCH_STATUS = 0
    BEGIN
        SET @lockName = 'downtimeReasonLock' + CAST(@machineID AS VARCHAR(25))

        -- Attempt to take the lock.
        EXEC @result = sp_getapplock
                        @Resource = @lockName,
                        @LockMode = 'Exclusive',
                        @LockOwner = 'Transaction',
                        @LockTimeout = 20000,
                        @DbPrincipal = 'public'
        IF @result < 0  -- if there was a problem getting the lock
        BEGIN
            SELECT @errorMsg =
                'Error getting ' + @lockName +
                '.  sp_getapplock returned ' + CAST(@result AS VARCHAR(25)) + ': ' +
                CASE @result
                    --WHEN  1  THEN 'The lock was granted successfully after waiting for other incompatible locks to be released.' (I left this here to show the one other possible return value, but it's impossible to get it here due to the IF statement above.)
                    WHEN   -1  THEN 'The lock request timed out.'
                    WHEN   -2  THEN 'The lock request was canceled.'
                    WHEN   -3  THEN 'The lock request was chosen as a deadlock victim.'
                    WHEN -999  THEN 'Indicates a parameter validation or other call error.'
                    ELSE            'Unexpected return value'
                END
            RAISERROR(@errorMsg, 16, 1)
        END

        FETCH NEXT FROM cursorMachine INTO @machineID;
    END;

    CLOSE cursorMachine;
    DEALLOCATE cursorMachine;

Update 2

I verified that my Python script that calls this procedure simply waits until the procedure returns rather than possibly calling it again while it's still executing the past instance. I also gathered more diagnostic data from the script. I don't see any evidence that the script is somehow calling the procedure more than once for the same set of machines concurrently. Therefore, I'm back to my original assumption that the three instances of this procedure from different plants that are operating on non-overlapping sets of machines are somehow causing the deadlocks.

Update 3:

I enabled SNAPSHOT isolation and set this procedure to use it. Rather than seeing dozens of daily deadlocks, I have not yet seen a single deadlock as of several days later, so I think I'm good to go. The downside is that this procedure's performance is much worse than before:

Average worker time is 3x prior value
Average elapsed time is 3x prior value
Max elapsed time is 2.5x prior value: 600 seconds (I have no idea…)

Best Answer

Looking at the execution plans, it is hard to argue with your conclusion that concurrent executions of that procedure shouldn't deadlock. They should be taking out key locks, on different keys, in the countLogLatest table, like ships passing in the night.

One thing that might cause the deadlock you showed is if the procedure is running more than once concurrently for the same plant's data. You mentioned:

There is a script for each plant that calls this procedure once per minute...

If the script takes longer than a minute for some reason, then the next iteration could run and deadlock with the first. Or it might be as simple as there is a bug in the script scheduling at one of the plants, resulting in the script firing twice.

If this is the problem, the best solution is probably to either tune the proc so it runs in less than a minute, or prevent overlapping execution on the task scheduling side somehow.

One other item that is worth pointing out is that the queries that deadlocked are running under the READ UNCOMMITTED isolation level.

It's hard for me to imagine that causing problems with the specific execution plans you showed. However, since you mentioned this is part of a larger procedure, if the entire procedure is running at this level, then it is easy to imagine odd concurrency stuff coming into play leading up to these queries (populating the table variables, for instance).

Generally speaking, READ UNCOMMITTED kind of gives me the heebie jeebies. It might be worth changing that to at least READ COMMITTED to see if it makes your UPDATE deadlocks go away.

A common cause of deadlocks is processing the same data in different orders across multiple sessions. You should reduce the risk of this occurring by adding an ORDER BY clause to the cursor query:

DECLARE cursorMachine CURSOR FOR SELECT machineID FROM @counts ORDER BY machineID;

To your point, I would kind of expect the "out of order" processing to result in app lock timeouts based on your description. But adding the ORDER BY won't hurt, and could resolve the deadlocks.

By the way, based on the code you've shared, each session is "accumulating" app locks on machines one at a time until the transaction commits (because you're not calling sp_releaseapplock). Maybe this is intentional, or the real code does release the lock after processing that machine, but I thought I'd mention it.

Update 1

Update 2

Update 3:

Best Answer

Related Solutions

Sql-server – Shared and IX locks causing deadlock (Sql server)

Sql-server – SELECT/INSERT Deadlock

Related Question