Sql-server – SQL Deadlock on the same exclusively locked clustered Key (with NHibernate) on delete / insert

deadlocknhibernatesql server

I have been working on this deadlock issue for quite a few days now and no matter what I do, it persists in one way or another.

First, the general premise:
We have Visits with VisitItems in a one to many relationship.

VisitItems relevant info:

CREATE TABLE [BAR].[VisitItems] (
    [Id]                INT             IDENTITY (1, 1) NOT NULL,
    [VisitType]         INT             NOT NULL,
    [FeeRateType]       INT             NOT NULL,
    [Amount]            DECIMAL (18, 2) NOT NULL,
    [GST]               DECIMAL (18, 2) NOT NULL,
    [Quantity]          INT             NOT NULL,
    [Total]             DECIMAL (18, 2) NOT NULL,
    [ServiceFeeType]    INT   NOT NULL,
    [ServiceText]       NVARCHAR (200)  NULL,
    [InvoicingProviderId] INT   NULL,
    [FeeItemId]        INT             NOT NULL,
    [VisitId]          INT             NULL,
    [IsDefault] BIT NOT NULL DEFAULT 0, 
    [SourceVisitItemId] INT NULL, 
    [OverrideCode] INT NOT NULL DEFAULT 0, 
    [InvoiceToCentre] BIT NOT NULL DEFAULT 0, 
    [IsSurchargeItem] BIT NOT NULL DEFAULT 0, 
    CONSTRAINT [PK_BAR.VisitItems] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [FK_BAR.VisitItems_BAR.FeeItems_FeeItem_Id] FOREIGN KEY ([FeeItemId]) REFERENCES [BAR].[FeeItems] ([Id]),
    CONSTRAINT [FK_BAR.VisitItems_BAR.Visits_Visit_Id] FOREIGN KEY ([VisitId]) REFERENCES [BAR].[Visits] ([Id]), 
    CONSTRAINT [FK_BAR.VisitItems_BAR.VisitTypes] FOREIGN KEY ([VisitType]) REFERENCES [BAR].[VisitTypes]([Id]), 
    CONSTRAINT [FK_BAR.VisitItems_BAR.FeeRateTypes] FOREIGN KEY ([FeeRateType]) REFERENCES [BAR].[FeeRateTypes]([Id]),
    CONSTRAINT [FK_BAR.VisitItems_CMN.Users_Id] FOREIGN KEY (InvoicingProviderId) REFERENCES [CMN].[Users] ([Id]),
    CONSTRAINT [FK_BAR.VisitItems_BAR.VisitItems_SourceVisitItem_Id] FOREIGN KEY ([SourceVisitItemId]) REFERENCES [BAR].[VisitItems]([Id]),
    CONSTRAINT [CK_SourceVisitItemId_Not_Equal_Id] CHECK ([SourceVisitItemId] <> [Id]),
    CONSTRAINT [FK_BAR.VisitItems_BAR.OverrideCodes] FOREIGN KEY ([OverrideCode]) REFERENCES [BAR].[OverrideCodes]([Id]),
    CONSTRAINT [FK_BAR.VisitItems_BAR.ServiceFeeTypes] FOREIGN KEY ([ServiceFeeType]) REFERENCES [BAR].[ServiceFeeTypes]([Id])
)

CREATE NONCLUSTERED INDEX [IX_FeeItem_Id]
    ON [BAR].[VisitItems]([FeeItemId] ASC)

CREATE NONCLUSTERED INDEX [IX_Visit_Id]
    ON [BAR].[VisitItems]([VisitId] ASC)

Visit info:

CREATE TABLE [BAR].[Visits] (
    [Id]                     INT            IDENTITY (1, 1) NOT NULL,
    [VisitType]              INT            NOT NULL,
    [DateOfService]          DATETIMEOFFSET  NOT NULL,
    [InvoiceAnnotation]      NVARCHAR(255)  NULL ,
    [PatientId]              INT            NOT NULL,
    [UserId]                 INT            NULL,
    [WorkAreaId]             INT            NOT NULL, 
    [DefaultItemOverride] BIT NOT NULL DEFAULT 0, 
    [DidNotWaitAdjustmentId] INT NULL, 
    [AppointmentId] INT NULL, 
    CONSTRAINT [PK_BAR.Visits] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [FK_BAR.Visits_CMN.Patients] FOREIGN KEY ([PatientId]) REFERENCES [CMN].[Patients] ([Id]) ON DELETE CASCADE,
    CONSTRAINT [FK_BAR.Visits_CMN.Users] FOREIGN KEY ([UserId]) REFERENCES [CMN].[Users] ([Id]),
    CONSTRAINT [FK_BAR.Visits_CMN.WorkAreas_WorkAreaId] FOREIGN KEY ([WorkAreaId]) REFERENCES [CMN].[WorkAreas] ([Id]), 
    CONSTRAINT [FK_BAR.Visits_BAR.VisitTypes] FOREIGN KEY ([VisitType]) REFERENCES [BAR].[VisitTypes]([Id]),
    CONSTRAINT [FK_BAR.Visits_BAR.Adjustments] FOREIGN KEY ([DidNotWaitAdjustmentId]) REFERENCES [BAR].[Adjustments]([Id]), 
);

CREATE NONCLUSTERED INDEX [IX_Visits_PatientId]
    ON [BAR].[Visits]([PatientId] ASC);

CREATE NONCLUSTERED INDEX [IX_Visits_UserId]
    ON [BAR].[Visits]([UserId] ASC);

CREATE NONCLUSTERED INDEX [IX_Visits_WorkAreaId]
    ON [BAR].[Visits]([WorkAreaId]);

Multiple users want to update the VisitItems table concurrently in the following way:

A separate web request will create a Visit with VisitItems (usually 1).
Then (the problem request):

  1. Web request comes in, opens NHibernate session, starts NHibernate transaction (using Repeatable Read with READ_COMMITTED_SNAPSHOT on).
  2. Read all the visit items for a given visit by VisitId.
  3. Code assesses if the items are still relevant or if we need new ones using complex rules (so slightly long running, e.g. 40ms).
  4. Code finds 1 item needs to be added, adds it using NHibernate Visit.VisitItems.Add(..)
  5. Code identifies that one item needs to be deleted (not the one we just added), removes it using NHibernate Visit.VisitItems.Remove(item).
  6. Code commits the transaction

With a tool I simulate 12 concurrent requests which is quite likely to happen in a future production environment.

[EDIT] Upon request, removed a lot of the investigation details I had added here to keep it short.

After lots of research the next step was to think of a way how I can lock hint on a different index to the one used in the where clause (i.e. the primary key, since that's used for deletion), so I altered my lock statement to:

var items = (List<VisitItem>)_session.CreateSQLQuery(@"SELECT * FROM BAR.VisitItems WITH (XLOCK, INDEX([PK_BAR.VisitItems]))
        WHERE VisitId = :visitId")
        .AddEntity(typeof(VisitItem))
        .SetParameter("visitId", qi.Visit.Id)
        .List<VisitItem>();

This reduced the deadlocks in frequency slightly, but they were still happening. And here is where I'm starting to get lost:

Three exclusive locks?

<deadlock-list>
  <deadlock victim="process3f71e64e8">
    <process-list>
      <process id="process3f71e64e8" taskpriority="0" logused="0" waitresource="KEY: 5:72057594071744512 (a5e1814e40ba)" waittime="3812" ownerId="8004520" transactionname="user_transaction" lasttranstarted="2015-12-14T10:24:58.010" XDES="0x3f7cb43b0" lockMode="X" schedulerid="1" kpid="15788" status="suspended" spid="63" sbid="0" ecid="0" priority="0" trancount="1" lastbatchstarted="2015-12-14T10:24:58.013" lastbatchcompleted="2015-12-14T10:24:58.013" lastattention="1900-01-01T00:00:00.013" clientapp=".Net SqlClient Data Provider" hostname="ABC" hostpid="10016" loginname="bsapp" isolationlevel="repeatable read (3)" xactid="8004520" currentdb="5" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
        <executionStack>
          <frame procname="adhoc" line="1" stmtstart="18" stmtend="254" sqlhandle="0x0200000024a9e43033ef90bb631938f939038627209baafb0000000000000000000000000000000000000000">
            unknown
          </frame>
          <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
            unknown
          </frame>
        </executionStack>
        <inputbuf>
          (@p0 int)SELECT * FROM BAR.VisitItems WITH (XLOCK, INDEX([PK_BAR.VisitItems]))
          WHERE VisitId = @p0
        </inputbuf>
      </process>
      <process id="process4105af468" taskpriority="0" logused="1824" waitresource="KEY: 5:72057594071744512 (8194443284a0)" waittime="3792" ownerId="8004519" transactionname="user_transaction" lasttranstarted="2015-12-14T10:24:58.010" XDES="0x3f02ea3b0" lockMode="S" schedulerid="8" kpid="15116" status="suspended" spid="65" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2015-12-14T10:24:58.033" lastbatchcompleted="2015-12-14T10:24:58.033" lastattention="1900-01-01T00:00:00.033" clientapp=".Net SqlClient Data Provider" hostname="ABC" hostpid="10016" loginname="bsapp" isolationlevel="repeatable read (3)" xactid="8004519" currentdb="5" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
        <executionStack>
          <frame procname="adhoc" line="1" stmtstart="18" stmtend="98" sqlhandle="0x0200000075abb0074bade5aa57b8357410941428df4d54130000000000000000000000000000000000000000">
            unknown
          </frame>
          <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
            unknown
          </frame>
        </executionStack>
        <inputbuf>
          (@p0 int)DELETE FROM BAR.VisitItems WHERE Id = @p0
        </inputbuf>
      </process>
    </process-list>
    <resource-list>
      <keylock hobtid="72057594071744512" dbid="5" objectname="BAR.VisitItems" indexname="PK_BAR.VisitItems" id="lock449e27500" mode="X" associatedObjectId="72057594071744512">
        <owner-list>
          <owner id="process4105af468" mode="X"/>
        </owner-list>
        <waiter-list>
          <waiter id="process3f71e64e8" mode="X" requestType="wait"/>
        </waiter-list>
      </keylock>
      <keylock hobtid="72057594071744512" dbid="5" objectname="BAR.VisitItems" indexname="PK_BAR.VisitItems" id="lock46a525080" mode="X" associatedObjectId="72057594071744512">
        <owner-list>
          <owner id="process3f71e64e8" mode="X"/>
        </owner-list>
        <waiter-list>
          <waiter id="process4105af468" mode="S" requestType="wait"/>
        </waiter-list>
      </keylock>
    </resource-list>
  </deadlock>
</deadlock-list>

A trace of the resulting number of queries looks like this.
[EDIT] Whoa. What a week. I have now updated the trace with the unredacted trace of the relevant statement that I think lead to the deadlock.

exec sp_executesql N'SELECT * FROM BAR.VisitItems WITH (XLOCK, INDEX([PK_BAR.VisitItems]))
                WHERE VisitId = @p0',N'@p0 int',@p0=3826
go
exec sp_executesql N'SELECT visititems0_.VisitId as VisitId1_, visititems0_.Id as Id1_, visititems0_.Id as Id37_0_, visititems0_.VisitType as VisitType37_0_, visititems0_.FeeItemId as FeeItemId37_0_, visititems0_.FeeRateType as FeeRateT4_37_0_, visititems0_.Amount as Amount37_0_, visititems0_.GST as GST37_0_, visititems0_.Quantity as Quantity37_0_, visititems0_.Total as Total37_0_, visititems0_.ServiceFeeType as ServiceF9_37_0_, visititems0_.ServiceText as Service10_37_0_, visititems0_.InvoiceToCentre as Invoice11_37_0_, visititems0_.IsDefault as IsDefault37_0_, visititems0_.OverrideCode as Overrid13_37_0_, visititems0_.IsSurchargeItem as IsSurch14_37_0_, visititems0_.VisitId as VisitId37_0_, visititems0_.InvoicingProviderId as Invoici16_37_0_, visititems0_.SourceVisitItemId as SourceV17_37_0_ FROM BAR.VisitItems visititems0_ WHERE visititems0_.VisitId=@p0',N'@p0 int',@p0=3826
go
exec sp_executesql N'INSERT INTO BAR.VisitItems (VisitType, FeeItemId, FeeRateType, Amount, GST, Quantity, Total, ServiceFeeType, ServiceText, InvoiceToCentre, IsDefault, OverrideCode, IsSurchargeItem, VisitId, InvoicingProviderId, SourceVisitItemId) VALUES (@p0, @p1, @p2, @p3, @p4, @p5, @p6, @p7, @p8, @p9, @p10, @p11, @p12, @p13, @p14, @p15); select SCOPE_IDENTITY()',N'@p0 int,@p1 int,@p2 int,@p3 decimal(28,5),@p4 decimal(28,5),@p5 int,@p6 decimal(28,5),@p7 int,@p8 nvarchar(4000),@p9 bit,@p10 bit,@p11 int,@p12 bit,@p13 int,@p14 int,@p15 int',@p0=1,@p1=452,@p2=1,@p3=0,@p4=0,@p5=1,@p6=0,@p7=1,@p8=NULL,@p9=0,@p10=1,@p11=0,@p12=0,@p13=3826,@p14=3535,@p15=NULL
go
exec sp_executesql N'UPDATE BAR.Visits SET VisitType = @p0, DateOfService = @p1, InvoiceAnnotation = @p2, DefaultItemOverride = @p3, AppointmentId = @p4, ReferralRequired = @p5, ReferralCarePlan = @p6, UserId = @p7, PatientId = @p8, WorkAreaId = @p9, DidNotWaitAdjustmentId = @p10, ReferralId = @p11 WHERE Id = @p12',N'@p0 int,@p1 datetimeoffset(7),@p2 nvarchar(4000),@p3 bit,@p4 int,@p5 bit,@p6 nvarchar(4000),@p7 int,@p8 int,@p9 int,@p10 int,@p11 int,@p12 int',@p0=1,@p1='2016-01-22 12:37:06.8915296 +08:00',@p2=NULL,@p3=0,@p4=NULL,@p5=0,@p6=NULL,@p7=3535,@p8=4246,@p9=2741,@p10=NULL,@p11=NULL,@p12=3826
go
exec sp_executesql N'DELETE FROM BAR.VisitItems WHERE Id = @p0',N'@p0 int',@p0=7919
go

Now my lock seems to have an effect since it's showing in the deadlock graph.
But what? Three exclusive locks and one shared lock? How does that work on the same object / key? I thought as long as you have an exclusive lock, you can't get a shared lock from somebody else? And the other way around. If you have a shared lock, nobody can gain exclusive lock, they have to wait.

I think I am lacking some deeper understanding here on how the locks work when they are taken on multiple keys on the same table.

Here are some of the things I have tried and their impact:

  • Added another index hint on IX_Visit_Id to the lock statement. No
    change
  • Added a second column to the IX_Visit_Id (the Id of the
    VisitItem column); far fetched, but tried anyway. No change
  • Changed Isolation level back to read committed (default in our project),
    deadlocks still happening
  • Changed Isolation level to serializable.
    Deadlocks still happening, but worse (different graphs). I don't
    really want to do that, anyway.
  • Taking a table lock makes them go away (obviously), but who would want to do that?
  • Taking a pessimistic application lock (using sp_getapplock) works, but that's pretty much the same thing as the table lock, don't want to do that.
  • Adding the READPAST hint to the XLOCK hint made no difference
  • I have turned off PageLock on the index and PK, no difference
  • I have added ROWLOCK hint to the XLOCK hint, made no difference

Some side note on NHibernate:
The way it is used and I understand it works is that it caches the sql statements until it really finds it necessary to execute them, unless you call flush, which we're trying not to do. So most of the statements (e.g. the lazily loaded Aggregate list of VisitItems => Visit.VisitItems) are executed only when necessary. Most of the actual update and delete statements from my transaction get executed at the end when the transaction is committed (as is evident from the sql trace above). I really have no control over the execution order; NHibernate decides when to do what. My initial lock statement is really only a work-around.

Also, with the lock statement, I'm just reading the items into an unused list (I'm not trying to override the VisitItems list on the Visit object since that's not how NHibernate is supposed to work as far as I can tell).
So even though I read the list first with the custom statement, NHibernate will still load the list again into its proxy object collection Visit.VisitItems using a separate sql call that I can see in the trace when it's time to lazily load it somewhere.

But that shouldn't matter, right? I already have the lock on said key? Loading it again won't change that?

As a final note, maybe to clarify: Each process is adding its own Visit with VisitItems first, then goes in and modifies it (which will trigger the delete and insert and the deadlock). In my tests, there is never any process changing the exact same Visit or VisitItems.

Does anybody have an idea on how to approach this any further? Anything I can try to get around this in an smart way (no table locks etc)? Also, I would like to learn why this tripple-x lock is even possible on the same object. I don't understand.

Please let me know if any more information is required to solve the puzzle.

[EDIT]
I updated the question with the DDL for the two tables involved.

Also I was asked for clarification on the expectation:
Yes, a few deadlocks here and there are ok, we'll just retry or get the user to re-submit (generally speaking). But at the current frequency with 12 concurrent users, I would expect there to only be one every few hours at most. Currently they pop up multiple times per minute.

In addition to that, I got some more information on the trancount=2, which might indicate a problem with nested transactions, which we're not really using. I will investigate that, too, and document the results here.

Best Answer

I made a couple comments to this effect, but I'm not sure you're getting the desired results when you combine Repeatable Read transaction isolation level with Read Committed Snapshot.

The TIL reported in your deadlock list is repeatable read, which is even more restrictive than Read Committed, and given the flow you describe, is likely leading to deadlocks.

What you might be trying to do is to have your DB TIL remain repeatable read, but set the transaction to use the snapshot TIL explicitly with a set transaction isolation level statement. Reference: https://msdn.microsoft.com/en-us/library/ms173763.aspx If so, I think you must have something incorrect. I'm not familiar with nHibernate, but it looks like there's a reference here: http://www.anujvarma.com/fluent-nhibernate-setting-database-transaction-isolation-level/

If your app's architecture will allow it, an option would be to try read committed snapshot at the db level, and if you still get deadlocks, enable snapshot with row versioning. NOTE that, if you do this, you need to re-think your tempdb setup if you enable snapshot (row versioning). I can get you all sorts of material on this if you need it - let me know.