Sql-server – SQL Deadlock on the same exclusively locked clustered Key (with NHibernate) on delete / insert

deadlocknhibernatesql server

I have been working on this deadlock issue for quite a few days now and no matter what I do, it persists in one way or another.

First, the general premise:
We have Visits with VisitItems in a one to many relationship.

VisitItems relevant info:

CREATE TABLE [BAR].[VisitItems] (
    [Id]                INT             IDENTITY (1, 1) NOT NULL,
    [VisitType]         INT             NOT NULL,
    [FeeRateType]       INT             NOT NULL,
    [Amount]            DECIMAL (18, 2) NOT NULL,
    [GST]               DECIMAL (18, 2) NOT NULL,
    [Quantity]          INT             NOT NULL,
    [Total]             DECIMAL (18, 2) NOT NULL,
    [ServiceFeeType]    INT   NOT NULL,
    [ServiceText]       NVARCHAR (200)  NULL,
    [InvoicingProviderId] INT   NULL,
    [FeeItemId]        INT             NOT NULL,
    [VisitId]          INT             NULL,
    [IsDefault] BIT NOT NULL DEFAULT 0, 
    [SourceVisitItemId] INT NULL, 
    [OverrideCode] INT NOT NULL DEFAULT 0, 
    [InvoiceToCentre] BIT NOT NULL DEFAULT 0, 
    [IsSurchargeItem] BIT NOT NULL DEFAULT 0, 
    CONSTRAINT [PK_BAR.VisitItems] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [FK_BAR.VisitItems_BAR.FeeItems_FeeItem_Id] FOREIGN KEY ([FeeItemId]) REFERENCES [BAR].[FeeItems] ([Id]),
    CONSTRAINT [FK_BAR.VisitItems_BAR.Visits_Visit_Id] FOREIGN KEY ([VisitId]) REFERENCES [BAR].[Visits] ([Id]), 
    CONSTRAINT [FK_BAR.VisitItems_BAR.VisitTypes] FOREIGN KEY ([VisitType]) REFERENCES [BAR].[VisitTypes]([Id]), 
    CONSTRAINT [FK_BAR.VisitItems_BAR.FeeRateTypes] FOREIGN KEY ([FeeRateType]) REFERENCES [BAR].[FeeRateTypes]([Id]),
    CONSTRAINT [FK_BAR.VisitItems_CMN.Users_Id] FOREIGN KEY (InvoicingProviderId) REFERENCES [CMN].[Users] ([Id]),
    CONSTRAINT [FK_BAR.VisitItems_BAR.VisitItems_SourceVisitItem_Id] FOREIGN KEY ([SourceVisitItemId]) REFERENCES [BAR].[VisitItems]([Id]),
    CONSTRAINT [CK_SourceVisitItemId_Not_Equal_Id] CHECK ([SourceVisitItemId] <> [Id]),
    CONSTRAINT [FK_BAR.VisitItems_BAR.OverrideCodes] FOREIGN KEY ([OverrideCode]) REFERENCES [BAR].[OverrideCodes]([Id]),
    CONSTRAINT [FK_BAR.VisitItems_BAR.ServiceFeeTypes] FOREIGN KEY ([ServiceFeeType]) REFERENCES [BAR].[ServiceFeeTypes]([Id])
)

CREATE NONCLUSTERED INDEX [IX_FeeItem_Id]
    ON [BAR].[VisitItems]([FeeItemId] ASC)

CREATE NONCLUSTERED INDEX [IX_Visit_Id]
    ON [BAR].[VisitItems]([VisitId] ASC)

Visit info:

CREATE TABLE [BAR].[Visits] (
    [Id]                     INT            IDENTITY (1, 1) NOT NULL,
    [VisitType]              INT            NOT NULL,
    [DateOfService]          DATETIMEOFFSET  NOT NULL,
    [InvoiceAnnotation]      NVARCHAR(255)  NULL ,
    [PatientId]              INT            NOT NULL,
    [UserId]                 INT            NULL,
    [WorkAreaId]             INT            NOT NULL, 
    [DefaultItemOverride] BIT NOT NULL DEFAULT 0, 
    [DidNotWaitAdjustmentId] INT NULL, 
    [AppointmentId] INT NULL, 
    CONSTRAINT [PK_BAR.Visits] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [FK_BAR.Visits_CMN.Patients] FOREIGN KEY ([PatientId]) REFERENCES [CMN].[Patients] ([Id]) ON DELETE CASCADE,
    CONSTRAINT [FK_BAR.Visits_CMN.Users] FOREIGN KEY ([UserId]) REFERENCES [CMN].[Users] ([Id]),
    CONSTRAINT [FK_BAR.Visits_CMN.WorkAreas_WorkAreaId] FOREIGN KEY ([WorkAreaId]) REFERENCES [CMN].[WorkAreas] ([Id]), 
    CONSTRAINT [FK_BAR.Visits_BAR.VisitTypes] FOREIGN KEY ([VisitType]) REFERENCES [BAR].[VisitTypes]([Id]),
    CONSTRAINT [FK_BAR.Visits_BAR.Adjustments] FOREIGN KEY ([DidNotWaitAdjustmentId]) REFERENCES [BAR].[Adjustments]([Id]), 
);

CREATE NONCLUSTERED INDEX [IX_Visits_PatientId]
    ON [BAR].[Visits]([PatientId] ASC);

CREATE NONCLUSTERED INDEX [IX_Visits_UserId]
    ON [BAR].[Visits]([UserId] ASC);

CREATE NONCLUSTERED INDEX [IX_Visits_WorkAreaId]
    ON [BAR].[Visits]([WorkAreaId]);

Multiple users want to update the VisitItems table concurrently in the following way:

A separate web request will create a Visit with VisitItems (usually 1).
Then (the problem request):

Web request comes in, opens NHibernate session, starts NHibernate transaction (using Repeatable Read with READ_COMMITTED_SNAPSHOT on).
Read all the visit items for a given visit by VisitId.
Code assesses if the items are still relevant or if we need new ones using complex rules (so slightly long running, e.g. 40ms).
Code finds 1 item needs to be added, adds it using NHibernate Visit.VisitItems.Add(..)
Code identifies that one item needs to be deleted (not the one we just added), removes it using NHibernate Visit.VisitItems.Remove(item).
Code commits the transaction

With a tool I simulate 12 concurrent requests which is quite likely to happen in a future production environment.

[EDIT] Upon request, removed a lot of the investigation details I had added here to keep it short.

After lots of research the next step was to think of a way how I can lock hint on a different index to the one used in the where clause (i.e. the primary key, since that's used for deletion), so I altered my lock statement to:

var items = (List<VisitItem>)_session.CreateSQLQuery(@"SELECT * FROM BAR.VisitItems WITH (XLOCK, INDEX([PK_BAR.VisitItems]))
        WHERE VisitId = :visitId")
        .AddEntity(typeof(VisitItem))
        .SetParameter("visitId", qi.Visit.Id)
        .List<VisitItem>();

This reduced the deadlocks in frequency slightly, but they were still happening. And here is where I'm starting to get lost:

<deadlock-list>
  <deadlock victim="process3f71e64e8">
    <process-list>
      <process id="process3f71e64e8" taskpriority="0" logused="0" waitresource="KEY: 5:72057594071744512 (a5e1814e40ba)" waittime="3812" ownerId="8004520" transactionname="user_transaction" lasttranstarted="2015-12-14T10:24:58.010" XDES="0x3f7cb43b0" lockMode="X" schedulerid="1" kpid="15788" status="suspended" spid="63" sbid="0" ecid="0" priority="0" trancount="1" lastbatchstarted="2015-12-14T10:24:58.013" lastbatchcompleted="2015-12-14T10:24:58.013" lastattention="1900-01-01T00:00:00.013" clientapp=".Net SqlClient Data Provider" hostname="ABC" hostpid="10016" loginname="bsapp" isolationlevel="repeatable read (3)" xactid="8004520" currentdb="5" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
        <executionStack>
          <frame procname="adhoc" line="1" stmtstart="18" stmtend="254" sqlhandle="0x0200000024a9e43033ef90bb631938f939038627209baafb0000000000000000000000000000000000000000">
            unknown
          </frame>
          <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
            unknown
          </frame>
        </executionStack>
        <inputbuf>
          (@p0 int)SELECT * FROM BAR.VisitItems WITH (XLOCK, INDEX([PK_BAR.VisitItems]))
          WHERE VisitId = @p0
        </inputbuf>
      </process>
      <process id="process4105af468" taskpriority="0" logused="1824" waitresource="KEY: 5:72057594071744512 (8194443284a0)" waittime="3792" ownerId="8004519" transactionname="user_transaction" lasttranstarted="2015-12-14T10:24:58.010" XDES="0x3f02ea3b0" lockMode="S" schedulerid="8" kpid="15116" status="suspended" spid="65" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2015-12-14T10:24:58.033" lastbatchcompleted="2015-12-14T10:24:58.033" lastattention="1900-01-01T00:00:00.033" clientapp=".Net SqlClient Data Provider" hostname="ABC" hostpid="10016" loginname="bsapp" isolationlevel="repeatable read (3)" xactid="8004519" currentdb="5" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
        <executionStack>
          <frame procname="adhoc" line="1" stmtstart="18" stmtend="98" sqlhandle="0x0200000075abb0074bade5aa57b8357410941428df4d54130000000000000000000000000000000000000000">
            unknown
          </frame>
          <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
            unknown
          </frame>
        </executionStack>
        <inputbuf>
          (@p0 int)DELETE FROM BAR.VisitItems WHERE Id = @p0
        </inputbuf>
      </process>
    </process-list>
    <resource-list>
      <keylock hobtid="72057594071744512" dbid="5" objectname="BAR.VisitItems" indexname="PK_BAR.VisitItems" id="lock449e27500" mode="X" associatedObjectId="72057594071744512">
        <owner-list>
          <owner id="process4105af468" mode="X"/>
        </owner-list>
        <waiter-list>
          <waiter id="process3f71e64e8" mode="X" requestType="wait"/>
        </waiter-list>
      </keylock>
      <keylock hobtid="72057594071744512" dbid="5" objectname="BAR.VisitItems" indexname="PK_BAR.VisitItems" id="lock46a525080" mode="X" associatedObjectId="72057594071744512">
        <owner-list>
          <owner id="process3f71e64e8" mode="X"/>
        </owner-list>
        <waiter-list>
          <waiter id="process4105af468" mode="S" requestType="wait"/>
        </waiter-list>
      </keylock>
    </resource-list>
  </deadlock>
</deadlock-list>

A trace of the resulting number of queries looks like this.
[EDIT] Whoa. What a week. I have now updated the trace with the unredacted trace of the relevant statement that I think lead to the deadlock.

exec sp_executesql N'SELECT * FROM BAR.VisitItems WITH (XLOCK, INDEX([PK_BAR.VisitItems]))
                WHERE VisitId = @p0',N'@p0 int',@p0=3826
go
exec sp_executesql N'SELECT visititems0_.VisitId as VisitId1_, visititems0_.Id as Id1_, visititems0_.Id as Id37_0_, visititems0_.VisitType as VisitType37_0_, visititems0_.FeeItemId as FeeItemId37_0_, visititems0_.FeeRateType as FeeRateT4_37_0_, visititems0_.Amount as Amount37_0_, visititems0_.GST as GST37_0_, visititems0_.Quantity as Quantity37_0_, visititems0_.Total as Total37_0_, visititems0_.ServiceFeeType as ServiceF9_37_0_, visititems0_.ServiceText as Service10_37_0_, visititems0_.InvoiceToCentre as Invoice11_37_0_, visititems0_.IsDefault as IsDefault37_0_, visititems0_.OverrideCode as Overrid13_37_0_, visititems0_.IsSurchargeItem as IsSurch14_37_0_, visititems0_.VisitId as VisitId37_0_, visititems0_.InvoicingProviderId as Invoici16_37_0_, visititems0_.SourceVisitItemId as SourceV17_37_0_ FROM BAR.VisitItems visititems0_ WHERE visititems0_.VisitId=@p0',N'@p0 int',@p0=3826
go
exec sp_executesql N'INSERT INTO BAR.VisitItems (VisitType, FeeItemId, FeeRateType, Amount, GST, Quantity, Total, ServiceFeeType, ServiceText, InvoiceToCentre, IsDefault, OverrideCode, IsSurchargeItem, VisitId, InvoicingProviderId, SourceVisitItemId) VALUES (@p0, @p1, @p2, @p3, @p4, @p5, @p6, @p7, @p8, @p9, @p10, @p11, @p12, @p13, @p14, @p15); select SCOPE_IDENTITY()',N'@p0 int,@p1 int,@p2 int,@p3 decimal(28,5),@p4 decimal(28,5),@p5 int,@p6 decimal(28,5),@p7 int,@p8 nvarchar(4000),@p9 bit,@p10 bit,@p11 int,@p12 bit,@p13 int,@p14 int,@p15 int',@p0=1,@p1=452,@p2=1,@p3=0,@p4=0,@p5=1,@p6=0,@p7=1,@p8=NULL,@p9=0,@p10=1,@p11=0,@p12=0,@p13=3826,@p14=3535,@p15=NULL
go
exec sp_executesql N'UPDATE BAR.Visits SET VisitType = @p0, DateOfService = @p1, InvoiceAnnotation = @p2, DefaultItemOverride = @p3, AppointmentId = @p4, ReferralRequired = @p5, ReferralCarePlan = @p6, UserId = @p7, PatientId = @p8, WorkAreaId = @p9, DidNotWaitAdjustmentId = @p10, ReferralId = @p11 WHERE Id = @p12',N'@p0 int,@p1 datetimeoffset(7),@p2 nvarchar(4000),@p3 bit,@p4 int,@p5 bit,@p6 nvarchar(4000),@p7 int,@p8 int,@p9 int,@p10 int,@p11 int,@p12 int',@p0=1,@p1='2016-01-22 12:37:06.8915296 +08:00',@p2=NULL,@p3=0,@p4=NULL,@p5=0,@p6=NULL,@p7=3535,@p8=4246,@p9=2741,@p10=NULL,@p11=NULL,@p12=3826
go
exec sp_executesql N'DELETE FROM BAR.VisitItems WHERE Id = @p0',N'@p0 int',@p0=7919
go

Now my lock seems to have an effect since it's showing in the deadlock graph.
But what? Three exclusive locks and one shared lock? How does that work on the same object / key? I thought as long as you have an exclusive lock, you can't get a shared lock from somebody else? And the other way around. If you have a shared lock, nobody can gain exclusive lock, they have to wait.

I think I am lacking some deeper understanding here on how the locks work when they are taken on multiple keys on the same table.

Here are some of the things I have tried and their impact:

Added another index hint on IX_Visit_Id to the lock statement. No
change
Added a second column to the IX_Visit_Id (the Id of the
VisitItem column); far fetched, but tried anyway. No change
Changed Isolation level back to read committed (default in our project),
deadlocks still happening
Changed Isolation level to serializable.
Deadlocks still happening, but worse (different graphs). I don't
really want to do that, anyway.
Taking a table lock makes them go away (obviously), but who would want to do that?
Taking a pessimistic application lock (using sp_getapplock) works, but that's pretty much the same thing as the table lock, don't want to do that.
Adding the READPAST hint to the XLOCK hint made no difference
I have turned off PageLock on the index and PK, no difference
I have added ROWLOCK hint to the XLOCK hint, made no difference

Some side note on NHibernate:
The way it is used and I understand it works is that it caches the sql statements until it really finds it necessary to execute them, unless you call flush, which we're trying not to do. So most of the statements (e.g. the lazily loaded Aggregate list of VisitItems => Visit.VisitItems) are executed only when necessary. Most of the actual update and delete statements from my transaction get executed at the end when the transaction is committed (as is evident from the sql trace above). I really have no control over the execution order; NHibernate decides when to do what. My initial lock statement is really only a work-around.

Also, with the lock statement, I'm just reading the items into an unused list (I'm not trying to override the VisitItems list on the Visit object since that's not how NHibernate is supposed to work as far as I can tell).
So even though I read the list first with the custom statement, NHibernate will still load the list again into its proxy object collection Visit.VisitItems using a separate sql call that I can see in the trace when it's time to lazily load it somewhere.

But that shouldn't matter, right? I already have the lock on said key? Loading it again won't change that?

As a final note, maybe to clarify: Each process is adding its own Visit with VisitItems first, then goes in and modifies it (which will trigger the delete and insert and the deadlock). In my tests, there is never any process changing the exact same Visit or VisitItems.

Does anybody have an idea on how to approach this any further? Anything I can try to get around this in an smart way (no table locks etc)? Also, I would like to learn why this tripple-x lock is even possible on the same object. I don't understand.

Please let me know if any more information is required to solve the puzzle.

[EDIT]
I updated the question with the DDL for the two tables involved.

Also I was asked for clarification on the expectation:
Yes, a few deadlocks here and there are ok, we'll just retry or get the user to re-submit (generally speaking). But at the current frequency with 12 concurrent users, I would expect there to only be one every few hours at most. Currently they pop up multiple times per minute.

In addition to that, I got some more information on the trancount=2, which might indicate a problem with nested transactions, which we're not really using. I will investigate that, too, and document the results here.

Best Answer

I made a couple comments to this effect, but I'm not sure you're getting the desired results when you combine Repeatable Read transaction isolation level with Read Committed Snapshot.

The TIL reported in your deadlock list is repeatable read, which is even more restrictive than Read Committed, and given the flow you describe, is likely leading to deadlocks.

What you might be trying to do is to have your DB TIL remain repeatable read, but set the transaction to use the snapshot TIL explicitly with a set transaction isolation level statement. Reference: https://msdn.microsoft.com/en-us/library/ms173763.aspx If so, I think you must have something incorrect. I'm not familiar with nHibernate, but it looks like there's a reference here: http://www.anujvarma.com/fluent-nhibernate-setting-database-transaction-isolation-level/

If your app's architecture will allow it, an option would be to try read committed snapshot at the db level, and if you still get deadlocks, enable snapshot with row versioning. NOTE that, if you do this, you need to re-think your tempdb setup if you enable snapshot (row versioning). I can get you all sorts of material on this if you need it - let me know.

Related Solutions

Sql-server – Shared and IX locks causing deadlock (Sql server)

You'll want to change your code to something like this.

DECLARE @AssetIssueId INT
INSERT INTO [Issue] ([IssueId], [AssetId]) VALUES (?, ?)
SET @AssetIssueID = scope_identity()
INSERT INTO [IssuePropertyValues] 
@AssetIssueID, ?, ?

As to the why...

If Java is starting a transaction for you automatically, and the insert of 10k+ rows is taking a few moments, then this thread is taking an IX on the Issue table. Another thread is then taking an IX on the issue table (which is blocked). This thread is then attempting to take an S on the issue table (for the INSERT SELECT FROM statement) which is blocked waiting for the other thread, and we now have a deadlock.

Sql-server – Why are these two INSERTs deadlocking? Is it the trigger? What does this deadlock trace 1222 log tell me

I didn't get a chance to rerun the trace and capture the more nicely formatted XDL file, but in the meantime I was able to resolve the issue by removing all the isolation level and locking hints from the trigger and modifying the application code to run the INSERT itself from within a serializable transaction.

The app uses LINQ to SQL, so the INSERT statement is dynamically generated and run under the default READ COMMITTED isolation level. From the deadlock trace, I could tell that the locks taken by the INSERT were conflicting with the locks taken by the trigger. The deadlock was happening even after removing all the isolation and locking hints from the trigger, so I surmised that the trigger's UPDATE statements were causing stronger locks than the INSERT statement. And when the competing simultaneous transaction tried to do the same thing, the deadlock occurred.

We were definitely on the right track with our previous attempted solutions: using higher isolation levels and stronger locking in the trigger. The problem is that the weaker locks were already taken by the INSERT statement generated by the LINQ to SQL, before the trigger ever got called. The solution is to explicitly start a transaction in our .NET code using SERALIZABLE isolation level, do the insert in that transaction, let the trigger happen in the same transaction (without altering the isolation level or lock hints) and then commit and dispose the transaction. When I did that, I was no longer able to reproduce the issue (I even put it back to the old way and verified that the issue returned.) Here is the new code:

            ' Do the insert in a serializable transaction to prevent simultaneous inserts from deadlocking
            ' due to the trigger, which reaches out to other rows.
            Me.packageContext.Connection.Open()
            Try
                Using oTransaction = Me.packageContext.Connection.BeginTransaction(IsolationLevel.Serializable)
                    Me.packageContext.Transaction = oTransaction
                    Try
                        Me.packageContext.SubmitChanges()
                        oTransaction.Commit()
                    Catch ex As Exception
                        oTransaction.Rollback()
                        Throw
                    End Try
                End Using
            Finally
                ' We need to close the connection and null out the transaction to allow
                ' subsequent uses of the same DataContext to work properly outside this transaction.
                Me.packageContext.Connection.Close()
                Me.packageContext.Transaction = Nothing
            End Try

Normally we avoid opening transactions from within client side code, because they can run long if the code does remote API calls, causing excessive blocking. But in this case we are isolating a very small bit of code, and we have robust error/rollback handling around it.

An alternate solution would be to implement stored procedures that do the INSERTs within explicit SERIALIZABLE transactions, and then alter the DataContext to call the stored procedure. However, that solution is less maintainable as it requires updating both the stored procedure and the DataContext definition every time the schema changes.

Finally, another potential alternate solution would be to examine the deadlock log in detail and find a way to adjust our indexes or keys to avoid the deadlock in the first place. However, that is more difficult and may not be possible with our design.

I think the solution above is the simplest and most understandable for future developers.

Best Answer

Related Solutions

Sql-server – Shared and IX locks causing deadlock (Sql server)

Sql-server – Why are these two INSERTs deadlocking? Is it the trigger? What does this deadlock trace 1222 log tell me

Related Question