Db2 – Remove all records with duplicates in db2. (Not just the duplicate records)

db2deleteduplication

How can I remove all the records with duplicates in db2. I have looked at various answers but they only remove the duplicates leaving one record from that set in the table. This is what I found already.

DELETE FROM
(SELECT ROWNUMBER() OVER (PARTITION BY ONE, TWO, THREE) AS RN
FROM SESSION.TEST) AS A
WHERE RN > 1;

But, I need a query that will remove all the records that contain duplicates not leaving behind one of them in the table.

A A 1 <– delete this
A A 2 <– delete this too
B B 3
C C 4

P.S: Using RN >= 1 does not work as it will make the table empty by deleting all records.

Best Answer

Assuming your desired uniqueness constraint is (one,two), here is one way to do it:

DELETE FROM session.test WHERE (one, two) IN ( SELECT one, two FROM session.test GROUP BY one, two HAVING COUNT(*) > 1 )

Related Solutions

Postgresql – How to remove duplicate records in a join table in PostgreSQL

In my experience (and as shown in many tests) NOT IN as demonstrated by @gsiems is rather slow and scales terribly. The inverse IN is typically faster (where you can reformulate that way, like in this case), but this query with EXISTS (doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:

DELETE FROM questions_tags q
WHERE  EXISTS (
   SELECT FROM questions_tags q1
   WHERE  q1.ctid < q.ctid
   AND    q1.question_id = q.question_id
   AND    q1.tag_id = q.tag_id
   );

Deletes every row where another row with the same (tag_id, question_id) and a smaller ctid exists. (Effectively keeps the first instance according to the physical order of tuples.) Using ctid in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).

ctid is the internal tuple identifier present in every row and necessarily unique. Further reading:

Test

I ran a test case with this table matched to your question and 100k rows:

CREATE TABLE questions_tags(
  question_id integer NOT NULL
, tag_id      integer NOT NULL
);

INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM   generate_series(1, 100000);

ANALYZE questions_tags;

Indexes do not help in this case.

Results

NOT IN
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.

EXISTS
Finishes in half a second in this SQLfiddle.

Alternatives

If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.

If you have dependencies and want to keep them, you could:

Drop all foreign keys and indexes - for performance.
SELECT survivors to a temporary table.
TRUNCATE the original.
Re-INSERT survivors.
Re-CREATE indexes and foreign keys. Views can just stay, they have no impact on performance. More here or here.

Postgresql – How to remove duplicates and keep the one with the most additional information

Your question is a bit light on definitions. Assuming:

You define the "least amount of information" with how many of the relevant columns are NULL.
Primary key is adr_id.
Duplicates are marked with a column dupe_id to indicate groups of duplicates.

Since it's also vague what to do exactly, I create a separate table with the dupe ranking:

CRATE TABLE adr_dupe_rank AS
SELECT adr_id, dupe_id
     , rank() OVER (PARTITION BY dupe_id
                    ORDER BY (nr_of_beds  IS NULL)::int
                           + (nr_of_baths IS NULL)::int
                        -- + (...)::int  -- more terms
                   ) AS rnk
FROM   address;

false translate to 0, true translates to 1. So rows with the fewest empty columns are ranked first. Master rows end up with rnk = 1. Dupes get higher rnk numbers.

The window function rank() assigns 1 to all rows sharing the lowest score per dupe_id. Add enough columns or other terms to break ties and get one winner per dupe_id. Or deal with multiple winners separately.

You can then do as you please. To just delete dupes:

DELETE FROM address a
USING  adr_dupe_rank d
WHERE  a.adr_id = d.adr_id
AND    d.rnk > 1;

Alternatively you can use the above query as derived table in the DELETE directly.