Sql-server – avoid TABLOCKX on DELETE

deletelockingsql serversql-server-2016

Goal:

Delete several billion records old records (roughly 600GB) across three tables with zero downtime.

Methodology and Background:

I plan to delete batches of records associated with 1 Million MyIDs at a time (ie MyID BETWEEN 1 AND 1000000). While the delete is running, two of the tables will be very hot but activity on the third table can be safely suspended. In the two hot tables, MyID is the clustering key. In the third, cold, table I have a non-clustered index on MyID.

Besides the DELETE operation, the other activity in the hot tables will consist of INSERT, probably several per second. MyID is an IDENTITY and the newly inserted records will not be in range of any of the DELETE batches.

If it helps understand potential performance, SELECT of an aggregate of a sample batch of these rows takes < 1 second on the tables clustered by MyID and about 2 seconds on the non-clustered table. I do not have a prod replica for this particular database to play in so I cannot say how long the deletes will take, however once I deploy this to DEV I'll have a better idea.

Side note: To address log size, I am wrapping these batches in transactions and upping the TLog backup frequency from 15 to 5 minutes. I have 150GB of available log space.

The Question

I read in the MS docs:

By default, a DELETE statement always acquires an exclusive (X) lock on the table it modifies, and holds that lock until the transaction completes.

I am very wary of using locking hints, but in this case can I safely use one to avoid TABLOCKX? Are there other ways besides a locking hint to avoid TABLOCKX?

Best Answer

Delete several billion records old records (roughly 600GB)

In this case it's may be more convenient to copy the rows that should remain rather than delete. Note that all 600Gb will move to your log file surely.

By default, a DELETE statement always acquires an exclusive (X) lock on the table it modifies, and holds that lock until the transaction completes.

This is wrong. SQL Server never starts with table lock unless the corresponding hint was specified.

By default, SQL Server will acquire the finest-grain lock possible, in order to attain the greatest concurrency. In most cases, this means SQL Server will acquire row (RID or KEY) locks. SQL Server can acquire hundreds or thousands of individual locks on data in a single table without causing any problems. In some cases, however, if SQL Server determines that a query will access a range of rows within a clustered index, it may instead acquire page locks. After all, if every row on a page is going to be accessed, it's easier to manage a single page lock than dozens, or hundreds, of row locks. In other cases, primarily when there is no usable index to help process a query, SQL Server may lock an entire table right at the beginning of processing a query.

In your case whith filtering on PK ranges: MyID BETWEEN 1 AND 1000000 it most probably will start with page locks. And here all depends on how many records fit in one page. If locking of 1000000 records with page granularity takes less than 5000 locks you'll be fine with page locks, if you'll have more locks per statement lock escalation will happen and your table will be completely locked.

...

In addition to escalating locks when an instance-wide threshold is crossed, SQL Server will also escalate locks when any individual session acquires more than 5,000 locks in a single statement. In this case, there is no randomness in choosing which session will get its locks escalated; it is the session that acquired the locks.

SQL Server Concurrency: Locking, Blocking and Row Versioning by K.Delaney

So only by testing and monitoring locks acquired you can find the optimal batch size and avoid lock escalation. You can approximately calculate your batch size if you know the average of rows per page in your table. Or you can force page locking by using PAGLOCK and even disable lock escalation on your table:

ALTER TABLE MyTable SET ( LOCK_ESCALATION = DISABLE )

Related Solutions

Sql-server – Clustered index not used in a delete statement

I'd like to know why the optimizer does not use the clustered index, but is using the non-clustered one?

This will be a decision of the cost based optimizer. It estimates that it is cheaper to fully scan the narrow index.

It seems that you were expecting a nested loops with seeks on the clustered index?

The execution plan shows that the table #ToPurge has 3.9 million rows.

This will take 195 batches to clear out. Assuming that all rows in #ToPurge are unique and initially exist in Production_Detail and that the rows output from #ToPurge are in a consistent order for each batch (I'm ignoring the parallelism) then each successive batch will end up seeking all rows that have already been deleted in previous batches before finally getting to a patch of rows that have not yet been processed then in total your delete process will do more than a third of a billion index seeks for already deleted rows.

0 + 20,000 + 40,000 + ... + 3,880,000 = 378,300,000

This isn't the logic that the cost based optimizer uses (which will involve histograms of the two tables and a rowgoal for the TOP 20000) but it shows this is not a good strategy anyway.

You could try the following to cap the number of seeks at 20,000 per iteration and see if you now get your desired plan.

CREATE TABLE #ToPurge
  (
     [BatchId] [INT] NULL,
     [Id]      [BIGINT] NOT NULL,
     UNIQUE CLUSTERED([BatchId], [Id])
  );

/*Your insert statement to #ToPurge here*/

WITH CTE
     AS (SELECT *,
                1 + ( ROW_NUMBER() OVER (ORDER BY Id) / 20000 ) AS _BatchId
         FROM   #ToPurge)
UPDATE CTE
SET    [BatchId] = [_BatchId];

DECLARE @BatchCounter INT = 1;

WHILE @BatchCounter <= (SELECT MAX([BatchId])
                        FROM   #ToPurge)
  BEGIN
      DELETE pd
      FROM   #ToPurge tp
             INNER JOIN [DataWarehouse].[dbo].[Production_Detail] pd
               ON ( tp.ID = pd.ID )
      WHERE  tp.BatchId = @BatchCounter

      SET @BatchCounter+=1;
  END

Sql-server – Slow delete with varbinary(max) column

I've just concluded a week of back and forth with Microsoft's gold partner support. They have confirmed that this situation is not getting any better and that these slow deletes are the normal behaviour of SQL Azure/2016.

Best Answer

Related Solutions

Sql-server – Clustered index not used in a delete statement

Sql-server – Slow delete with varbinary(max) column

Related Question