Assuming your desired uniqueness constraint is (one,two)
, here is one way to do it:
DELETE FROM session.test WHERE (one, two) IN
(
SELECT one, two FROM session.test
GROUP BY one, two
HAVING COUNT(*) > 1
)
Using the ranking ROW_NUMBER()
function will work. First give row numbers to all rows in both tables, then join using these row numbers, then update:
with
oldt as
( select fileNo , folder, fileType,
row_number() over (partition by fileNo, folder
order by fileType)
as rn
from oldtable
),
newt as
( select fileNo , folder, fileType,
row_number() over (partition by fileNo, folder
order by fileType)
as rn
from newtable
),
upd as
( select
n.fileType,
o.fileType as old_fileType
from newt n
join oldt o
on n.fileNo = o.fileNo
and n.folder = o.folder
and n.rn = o.rn
)
update
upd
set
fileType = old_fileType ;
SQLfiddle seems to be giving error for Oracle, so it has been tested in SQL Server only: SQLfiddle-test (but this syntax should be valid for Oracle, too).
Tested in Oracle, the above doesn't work, sadly. I think because statements that start with WITH
can only be SELECT
statements. Even if the query is rearranged (I tried several rewrites), Oracle throws various errors. The only way I managed to have it working is after adding another column in newtable
and a unique constraint on it. Then the following works (nid
is the added primary key column).
Tested in Oracle's Live SQL site:
update
( with
oldt as
( select fileNo , folder, fileType,
row_number() over (partition by fileNo, folder
order by fileType)
as rn
from oldtable
),
newt as
( select fileNo , folder, nid,
row_number() over (partition by fileNo, folder
order by fileType)
as rn
from newtable
),
upd as
( select
n.nid,
o.fileType as old_fileType
from newt n
join oldt o
on n.fileNo = o.fileNo
and n.folder = o.folder
and n.rn = o.rn
)
select
up.fileType,
( select upd.old_fileType
from upd
where upd.nid = up.nid
) as old_fileType
from newtable up
) x
set fileType = old_fileType ;
Best Answer
In my experience (and as shown in many tests)
NOT IN
as demonstrated by @gsiems is rather slow and scales terribly. The inverseIN
is typically faster (where you can reformulate that way, like in this case), but this query withEXISTS
(doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:Deletes every row where another row with the same
(tag_id, question_id)
and a smallerctid
exists. (Effectively keeps the first instance according to the physical order of tuples.) Usingctid
in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).ctid
is the internal tuple identifier present in every row and necessarily unique. Further reading:Test
I ran a test case with this table matched to your question and 100k rows:
Indexes do not help in this case.
Results
NOT IN
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.
EXISTS
Finishes in half a second in this SQLfiddle.
Alternatives
If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.
If you have dependencies and want to keep them, you could:
SELECT
survivors to a temporary table.TRUNCATE
the original.INSERT
survivors.CREATE
indexes and foreign keys. Views can just stay, they have no impact on performance. More here or here.