SQL Server Optimization – Most Efficient DELETE Method

optimizationsql server

I must choose between three possible DELETE methods in SQL Server. The table: RawData stays at around 500GB.

Here is the table:

CREATE TABLE dbo.rawData ( 
    rowId INT IDENTITY PRIMARY KEY,
    AreaId INT, 
    MeasureId INT, 
    someData UNIQUEIDENTIFIER DEFAULT NEWID(), 
    DateEnergy DATETIME DEFAULT GETDATE(), 
    addedBy VARCHAR(30) DEFAULT SUSER_NAME(), 
    ts ROWVERSION 
    )
GO


CONSTRAINT [PK_RawData] PRIMARY KEY CLUSTERED ([AreaId] ASC, [MeasureId] ASC, [DateEnergy] ASC)

CREATE INDEX IX_RawData_AreaId
  ON RawData (AreaId);
GO

CREATE NONCLUSTERED INDEX [IX_RawData_DateEnergy]
ON [dbo].[RawData] ([DateEnergy])
INCLUDE ([AreaId],[MeasureId])

GO
CREATE NONCLUSTERED INDEX [IX_RawData_MeasureId_DateEnergy]
ON [dbo].[RawData] ([MeasureId],[DateEnergy])
INCLUDE ([AreaId])

Method 1: Add a column IsDeleted type boolean.

  • Would require that this column be indexed as well (exclude rows
    deleted by the SELECT)
  • Update this boolean column (update = rowlock)?

Method 2: Add a new table that will store rows that need to be deleted

  • Outer join between the two tables for the SELECT
  • No update necessary on RawData

Method 3: Use an existing column to detect rows to be deleted

  • Update the AreaID to a negative value to mark it for deletion (update
    = rowlock)?
  • Use the existing suppression system that removes rows by batches of
    10000

Constraints:

The deletion might need to be cancelled at any given time

Questions:

  1. Which method would be the most efficent?

  2. Are there any other methods not mentionned that might be better?

Update July 11th 2019

The number of rows being deleted depends on their collection date. So there could be millions or hundreds or none.

Best Answer

The best solution to this problem is nearly always method 2, but with two tables UNION ALL instead of LEFT JOIN together (one to hold the deleted, one for the active). There are several reasons why this is superior:

  1. You can maintain separate statistics on active and deleted rows. Assumption: you will have more deleted than active over time. This means that the "active rows" table stays small and doing fullscans is easier. It also makes things like index rebuilds (if you do them) faster on the active rows.
  2. Statistics of old rows will not affect news rows. This helps you avoid skew problems.
  3. Each table can live on it's own filegroup. This means you can move deleted rows off to cheaper storage.
  4. You can have different indexing/partitioning strategies on the deleted rows vs. the active ones. For example, you may choose to use a column store index on the old rows if they are read via a scan often, but changed very rarely.
  5. The deleted table can be taken offline with a table SWITCH (for maintenance) without disrupting the system too much.

The two table solution can be implemented with a DELETE instead of trigger that moves the rows to deleted instead of deleting them.

The only downside of solution 2 is that you will need to modify your queries to distinguish between deleted and not deleted rows and tables. This can be done via a view, but it is safer to avoid this if possible. The view can confuse the optimiser and there are cases where using a view instead of a table gives you horrible execution plans. If you do use a view, you should add an `IsDeleted' column to both tables and put a check constraint on it. This ensure that the optimiser will not try to seek BOTH table for every query (which would double your IOPS).

For completeness, and to answer your last question, there are some other ways to "solve" this:

Method 4: Add an IsDeleted column and partition on this column. Drawback: Does not allow separate statistics on the two partitions (statistics are table, not partition based, in SQL Server)

Method 5: Use a filtered index to apply different indexing (and secondary storage) on deleted and active rows. Drawback: Good luck getting the optimiser to behave properly with this method.