Convert your select statement into a CTE, and DELETE FROM
the CTE, as in:
;WITH del AS
(
SELECT *
FROM (
SELECT RN = ROW_NUMBER() OVER (
PARTITION BY SC.ID ORDER BY SC.id
)
,SC.*
FROM [ETL].[Stage_Claims] SC
WHERE ID IN (
SELECT ID
FROM (
SELECT RN = ROW_NUMBER() OVER (
PARTITION BY ID ORDER BY id
)
,ID
FROM [ETL].[DUPS_Claims]
) AS t1
WHERE RN > 1
)
) AS t2
WHERE RN > 1
)
DELETE FROM del;
Standard warning: You should test this in a non-production environment.
You can simplify your query quite a bit, and likely get better performance by using the below query, which does not make use of the intermediate table, DUPS_Claims
, since it is absolutely unnecessary:
;WITH cte AS
(
SELECT sc.ID
, rn = ROW_NUMBER() OVER (PARTITION BY sc.ID ORDER BY sc.ID)
FROM ETL.Stage_Claims sc
)
DELETE
FROM cte
WHERE rn > 1;
I created a non-clustered, non-unique index on both tables, then looked at the execution plans for both variations.
The first variant:
The second variant:
The first variant scans the index twice, whereas the second variant clearly only needs to scan the index a single time, and doesn't require a relatively expensive merge join in my somewhat contrived example. My sample ETL.Stage_Claims
table contains 89 unique ID
values, each duplicated 89 times, for a total of 7921 rows.
In case CTEs aren't your thing, you can use this approach to delete from a derived table, instead:
DELETE c
FROM (
SELECT sc.ID
, rn = ROW_NUMBER() OVER (PARTITION BY sc.ID ORDER BY sc.ID)
FROM ETL.Stage_Claims sc
) c
WHERE rn > 1;
The query plan for the above DELETE
from the derived table:
I have a confession to make.
Once, when I was young, I built an ETL process started with changing read-only filegroups to read-write, doing its ETL work, and then setting them back to read-only.
So just in case you have a coworker who was diabolical like I was (I was young, I needed the money), you can test by:
Change the name of the read-only filegroup - that way, if someone has hard-coded scripts that alter the filegroup by name, their scripts will fail, and you'll catch the culprit. Or, a little harder:
Use Profiler or Extended Events to track anyone who does an ALTER DATABASE.
Best Answer
The
subquery
you have in your code is called a derived table. It's not a base table but a table that "lives" during the time that the query runs. Like views (which are also called viewed tables) - and in recent versions CTEs which is another, 4th way to "define" a table inside a query - they are similar to a table in many ways. You canselect
from them, you can use them infrom
or tojoin
them to other tables (base or not!).In some DBMS, (not all DBMS have implemented this the same way) these tables/views are updatable. And "updatable" means that we can also
update
,insert
into ordelete
from them.There are restrictions though and this is expected. Imagine if the
subquery
was a join of 2 (or 17 tables). What woulddelete
mean then? (from which tables should rows be deleted?) Updatable views is a very complicated matter. There's a recent (2012) book, entirely on this subject, written by Chris Date, well known expert in relational theory: View Updating and Relational Theory.When the derived table (or view) is a very simple query, like it has only one base table (possibly restricted by a
WHERE
) and noGROUP BY
, then every row of the derived table corresponds to one row in the underlying base table, so it is easy* to update, insert or delete from this.When the code inside the subquery is more complex, it depends on whether the rows of the derived table/view can be traced/resolved to rows from one of the underlying base tables.
For SQL Server, you can read more in the Updatable Views paragraph in MSDN:
CREATE VIEW
.Actually
delete
is easier, less complex thanupdate
. SQL Server only needs the primary keys or some other way to identify which rows of the base table are to be deleted. Forupdate
, there is an additional (rather obvious) restriction that we can't update a computed column. You can try to modify your query to do an update. Updating theCreatedDateTime
will probably work just fine but trying to update the computedRowNumber
column will raise an error. Andinsert
is even more complex, as we'd have to provide values for all the columns of the base table that don't have aDEFAULT
constraint.