SQL Server Optimization – Most Efficient DELETE Method

optimizationsql server

I must choose between three possible DELETE methods in SQL Server. The table: RawData stays at around 500GB.

Here is the table:

CREATE TABLE dbo.rawData ( 
    rowId INT IDENTITY PRIMARY KEY,
    AreaId INT, 
    MeasureId INT, 
    someData UNIQUEIDENTIFIER DEFAULT NEWID(), 
    DateEnergy DATETIME DEFAULT GETDATE(), 
    addedBy VARCHAR(30) DEFAULT SUSER_NAME(), 
    ts ROWVERSION 
    )
GO


CONSTRAINT [PK_RawData] PRIMARY KEY CLUSTERED ([AreaId] ASC, [MeasureId] ASC, [DateEnergy] ASC)

CREATE INDEX IX_RawData_AreaId
  ON RawData (AreaId);
GO

CREATE NONCLUSTERED INDEX [IX_RawData_DateEnergy]
ON [dbo].[RawData] ([DateEnergy])
INCLUDE ([AreaId],[MeasureId])

GO
CREATE NONCLUSTERED INDEX [IX_RawData_MeasureId_DateEnergy]
ON [dbo].[RawData] ([MeasureId],[DateEnergy])
INCLUDE ([AreaId])

Method 1: Add a column IsDeleted type boolean.

Would require that this column be indexed as well (exclude rows
deleted by the SELECT)
Update this boolean column (update = rowlock)?

Method 2: Add a new table that will store rows that need to be deleted

Outer join between the two tables for the SELECT
No update necessary on RawData

Method 3: Use an existing column to detect rows to be deleted

Update the AreaID to a negative value to mark it for deletion (update
= rowlock)?
Use the existing suppression system that removes rows by batches of
10000

Constraints:

The deletion might need to be cancelled at any given time

Questions:

Which method would be the most efficent?
Are there any other methods not mentionned that might be better?

Update July 11th 2019

The number of rows being deleted depends on their collection date. So there could be millions or hundreds or none.

Best Answer

The best solution to this problem is nearly always method 2, but with two tables UNION ALL instead of LEFT JOIN together (one to hold the deleted, one for the active). There are several reasons why this is superior:

You can maintain separate statistics on active and deleted rows. Assumption: you will have more deleted than active over time. This means that the "active rows" table stays small and doing fullscans is easier. It also makes things like index rebuilds (if you do them) faster on the active rows.
Statistics of old rows will not affect news rows. This helps you avoid skew problems.
Each table can live on it's own filegroup. This means you can move deleted rows off to cheaper storage.
You can have different indexing/partitioning strategies on the deleted rows vs. the active ones. For example, you may choose to use a column store index on the old rows if they are read via a scan often, but changed very rarely.
The deleted table can be taken offline with a table SWITCH (for maintenance) without disrupting the system too much.

The two table solution can be implemented with a DELETE instead of trigger that moves the rows to deleted instead of deleting them.

The only downside of solution 2 is that you will need to modify your queries to distinguish between deleted and not deleted rows and tables. This can be done via a view, but it is safer to avoid this if possible. The view can confuse the optimiser and there are cases where using a view instead of a table gives you horrible execution plans. If you do use a view, you should add an `IsDeleted' column to both tables and put a check constraint on it. This ensure that the optimiser will not try to seek BOTH table for every query (which would double your IOPS).

For completeness, and to answer your last question, there are some other ways to "solve" this:

Method 4: Add an IsDeleted column and partition on this column. Drawback: Does not allow separate statistics on the two partitions (statistics are table, not partition based, in SQL Server)

Method 5: Use a filtered index to apply different indexing (and secondary storage) on deleted and active rows. Drawback: Good luck getting the optimiser to behave properly with this method.

Related Solutions

Sql-server – deteriorating stored procedure running times

What is up with FROM part JOIN model ON 1=1? This the same as FROM part, model, which is a cartesian join and will result in a very large number of rows. Is that join supposed to be like that?

You will likely help us help you if you provide details about the tables involved. Please "script" the definition of the tables, along with any indexes defined on those tables.

This sounds like a classic case of parameter sniffing resulting in good plan/bad plan choices for various scenarios in your data.

You may be able to get more reliable performance by making SQL Server cache different plans for different scenarios by using sp_executesql, as in the following example:

CREATE PROCEDURE [dbo].[create_grid_materials2] 
(
    @partlistid bigint
    , @pid bigint
    , @masterid bigint
)
AS
BEGIN
    begin
        DECLARE @cmd NVARCHAR(MAX);

        SET @cmd = '   
        INSERT INTO material (partid, personid, modelID)
        SELECT 
            partid = part.id
            , personid = @pid
            , modelid = model.id  
        FROM part
            INNER JOIN model ON 1=1
        WHERE (
            model.masterid = ' + CONVERT(NVARCHAR(50), @masterid) + ' 
                AND model.modelSetID IS NULL
                AND part.partlistid = ' + CONVERT(NVARCHAR(50), @partlistid) + '
                AND (
                    part.partType = 100 
                    or part.partType=120 
                    or part.partType = 130
                )
            )
            AND NOT EXISTS (
                SELECT 1 
                FROM material AS a1 
                WHERE a1.partid = part.id 
                    AND a1.personid=@pid 
                    AND a1.modelid=model.id
                )';
        DECLARE @Params VARCHAR(200);
        SET @Params = '@pid INT';
        EXEC sys.sp_executesql @cmd
            , @Params
            , @pid = @pid;
    end
End

The above code will cause a new plan to be generated for each combination of @partlistid, and @masterid.

The presumption here is some combinations of those two variables lead to a very small number of rows, whereas some combinations lead to a very large number of rows.

Forcing a plan for each combination allows SQL Server to generate more efficient plans for each. I've explicitly not included @pid since you probably want to try it with a fairly small number of combinations first; adding a third variable to the mix will make for an exponentially larger number of possible plans.

Sql-server – Returning value with Clustered index

When you created the clustered index with ename the rows in the table were shuffled to be in that order. When you dropped the index the rows were not reshuffled.

There is no guarantee that the rows will be returned by a query in the order that they are stored, but often this is what happens.

If the order is important then you would add an ORDER BY (and would be well advised to cluster the table on that field too to avoid sorting every time).

But don't choose a clustering key just because of that. For a more thorough discussion of clustering and heaps see http://kejser.org/clustered-indexes-vs-heaps/

Best Answer

Related Solutions

Sql-server – deteriorating stored procedure running times

Sql-server – Returning value with Clustered index

Related Question