SQL Server – DELETE from Subquery or Derived Table

sql server

Is there a way I can turn the below SELECT statement into a DELETE?

I would like to delete the corresponding returned records from the [ETL].[Stage_Claims] table.

Since I used derived tables I can't reference the Stage_Claims table.

To summarize, the 2 physical tables used in the below query have identical structures. Only difference is DUPS_Claims is a subset of Stage_Claims.

DUPS_Claims contains duplicate records found in Stage_Claims. If a record exists 3 times in Stage_Claims, we will have that record 3 times in DUPS_Claims as well.

Stage_Claims contains all records including the duplicate records in DUPS_Claims.

I would like to remove the duplicate records from Stage_Claims leaving only 1 unique record for every duplicated record.

Stage_Claims has just short of 1 million rows so I do not want to use Row_Number / Partition on the entire table as it takes over 2 minutes to run.

The below query I have runs in about 15 seconds and successfully identifies only the duplicate records (not including the original unique record we want to keep) but I have not been able to figure out how to delete the records that are returned from SC.

Is it possible or should I just take a different approach?

SELECT *
FROM (
    SELECT RN = ROW_NUMBER() OVER (
            PARTITION BY SC.ID ORDER BY SC.id
            )
        ,SC.*
    FROM [ETL].[Stage_Claims] SC
    WHERE ID IN (
            SELECT ID
            FROM (
                SELECT RN = ROW_NUMBER() OVER (
                        PARTITION BY ID ORDER BY id
                        )
                    ,ID
                FROM [ETL].[DUPS_Claims]
                ) AS t1
            WHERE RN > 1
            )
    ) AS t2
WHERE RN > 1

Best Answer

Convert your select statement into a CTE, and DELETE FROM the CTE, as in:

;WITH del AS
(
SELECT *
FROM (
    SELECT RN = ROW_NUMBER() OVER (
            PARTITION BY SC.ID ORDER BY SC.id
            )
        ,SC.*
    FROM [ETL].[Stage_Claims] SC
    WHERE ID IN (
            SELECT ID
            FROM (
                SELECT RN = ROW_NUMBER() OVER (
                        PARTITION BY ID ORDER BY id
                        )
                    ,ID
                FROM [ETL].[DUPS_Claims]
                ) AS t1
            WHERE RN > 1
            )
    ) AS t2
WHERE RN > 1
)
DELETE FROM del;

Standard warning: You should test this in a non-production environment.

You can simplify your query quite a bit, and likely get better performance by using the below query, which does not make use of the intermediate table, DUPS_Claims, since it is absolutely unnecessary:

;WITH cte AS
(
    SELECT sc.ID
        , rn = ROW_NUMBER() OVER (PARTITION BY sc.ID ORDER BY sc.ID)
    FROM ETL.Stage_Claims sc
)
DELETE
FROM cte 
WHERE rn > 1;

I created a non-clustered, non-unique index on both tables, then looked at the execution plans for both variations.

The first variant:

The second variant:

The first variant scans the index twice, whereas the second variant clearly only needs to scan the index a single time, and doesn't require a relatively expensive merge join in my somewhat contrived example. My sample ETL.Stage_Claims table contains 89 unique ID values, each duplicated 89 times, for a total of 7921 rows.

In case CTEs aren't your thing, you can use this approach to delete from a derived table, instead:

DELETE c
FROM (
    SELECT sc.ID
        , rn = ROW_NUMBER() OVER (PARTITION BY sc.ID ORDER BY sc.ID)
    FROM ETL.Stage_Claims sc
) c
WHERE rn > 1;

The query plan for the above DELETE from the derived table:

Related Solutions

Sql-server – Adding Row Version to SQL Server table

In such cases I am storing version numbers, and using referential integrity to make sure version numbers begin with 1 and have no gaps. Because I use constraints, I am sure my solution works in high concurrency situations. I have described this approach in this article

SQL Server – Understanding Semaphore Usage

Generally your current code shouldn't allow two concurrent transactions to process the same @Sbs_Type simultaneously.

Two concurrent threads can't both acquire U locks on the rows matching WHERE Sbs_Trf_Type = @Sbs_Type.

If two threads simultaneously call GetSubscribeNo with the same @Sbs_Type when there are no matching rows whilst meanwhile a third transaction inserts a row for that @Sbs_Type I can see how this would fail however.

One or both of the GetSubscribeNo calls would need to get past the UPDATE statement before the insert and thus fail to lock anything that would block the other one.

A better way of doing this anyway would be with the OUTPUT clause.

ALTER PROCEDURE [dbo].[GetSubscribeNo] @Sbs_Type    INT,
                                       @subscribeNo INT OUTPUT
AS
  BEGIN
      DECLARE @Out TABLE (Subscribe_No INT);

      SET XACT_ABORT ON;
      BEGIN TRAN T1;

      WITH T
           AS ((SELECT TOP(1) *
                FROM   UNIT_SBS_NOS WITH (ROWLOCK, READPAST)
                WHERE  Sbs_Trf_Type = @Sbs_Type))
      DELETE FROM T
      OUTPUT DELETED.FREE_SBS_NO INTO @Out;

      /*Default to -1 if no rows*/
      SELECT @subscribeNo = ISNULL(MAX(Subscribe_No), -1)
      FROM   @Out

      INSERT INTO All_Subscribe
                  (Subscribe_No)
      VALUES      (@subscribeNo);

      COMMIT TRAN T1;
  END

Best Answer

Related Solutions

Sql-server – Adding Row Version to SQL Server table

SQL Server – Understanding Semaphore Usage

Related Question