I must choose between three possible DELETE methods in SQL Server. The table: RawData stays at around 500GB.
Here is the table:
CREATE TABLE dbo.rawData (
rowId INT IDENTITY PRIMARY KEY,
AreaId INT,
MeasureId INT,
someData UNIQUEIDENTIFIER DEFAULT NEWID(),
DateEnergy DATETIME DEFAULT GETDATE(),
addedBy VARCHAR(30) DEFAULT SUSER_NAME(),
ts ROWVERSION
)
GO
CONSTRAINT [PK_RawData] PRIMARY KEY CLUSTERED ([AreaId] ASC, [MeasureId] ASC, [DateEnergy] ASC)
CREATE INDEX IX_RawData_AreaId
ON RawData (AreaId);
GO
CREATE NONCLUSTERED INDEX [IX_RawData_DateEnergy]
ON [dbo].[RawData] ([DateEnergy])
INCLUDE ([AreaId],[MeasureId])
GO
CREATE NONCLUSTERED INDEX [IX_RawData_MeasureId_DateEnergy]
ON [dbo].[RawData] ([MeasureId],[DateEnergy])
INCLUDE ([AreaId])
Method 1: Add a column IsDeleted type boolean.
- Would require that this column be indexed as well (exclude rows
deleted by the SELECT) - Update this boolean column (update = rowlock)?
Method 2: Add a new table that will store rows that need to be deleted
- Outer join between the two tables for the SELECT
- No update necessary on RawData
Method 3: Use an existing column to detect rows to be deleted
- Update the AreaID to a negative value to mark it for deletion (update
= rowlock)? - Use the existing suppression system that removes rows by batches of
10000
Constraints:
The deletion might need to be cancelled at any given time
Questions:
-
Which method would be the most efficent?
-
Are there any other methods not mentionned that might be better?
Update July 11th 2019
The number of rows being deleted depends on their collection date. So there could be millions or hundreds or none.
Best Answer
The best solution to this problem is nearly always method 2, but with two tables UNION ALL instead of LEFT JOIN together (one to hold the deleted, one for the active). There are several reasons why this is superior:
The two table solution can be implemented with a DELETE instead of trigger that moves the rows to deleted instead of deleting them.
The only downside of solution 2 is that you will need to modify your queries to distinguish between deleted and not deleted rows and tables. This can be done via a view, but it is safer to avoid this if possible. The view can confuse the optimiser and there are cases where using a view instead of a table gives you horrible execution plans. If you do use a view, you should add an `IsDeleted' column to both tables and put a check constraint on it. This ensure that the optimiser will not try to seek BOTH table for every query (which would double your IOPS).
For completeness, and to answer your last question, there are some other ways to "solve" this:
Method 4: Add an
IsDeleted
column and partition on this column. Drawback: Does not allow separate statistics on the two partitions (statistics are table, not partition based, in SQL Server)Method 5: Use a filtered index to apply different indexing (and secondary storage) on deleted and active rows. Drawback: Good luck getting the optimiser to behave properly with this method.