In my experience (and as shown in many tests) NOT IN
as demonstrated by @gsiems is rather slow and scales terribly. The inverse IN
is typically faster (where you can reformulate that way, like in this case), but this query with EXISTS
(doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:
DELETE FROM questions_tags q
WHERE EXISTS (
SELECT FROM questions_tags q1
WHERE q1.ctid < q.ctid
AND q1.question_id = q.question_id
AND q1.tag_id = q.tag_id
);
Deletes every row where another row with the same (tag_id, question_id)
and a smaller ctid
exists. (Effectively keeps the first instance according to the physical order of tuples.) Using ctid
in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).
ctid
is the internal tuple identifier present in every row and necessarily unique. Further reading:
Test
I ran a test case with this table matched to your question and 100k rows:
CREATE TABLE questions_tags(
question_id integer NOT NULL
, tag_id integer NOT NULL
);
INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM generate_series(1, 100000);
ANALYZE questions_tags;
Indexes do not help in this case.
Results
NOT IN
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.
EXISTS
Finishes in half a second in this SQLfiddle.
Alternatives
If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.
If you have dependencies and want to keep them, you could:
- Drop all foreign keys and indexes - for performance.
SELECT
survivors to a temporary table.
TRUNCATE
the original.
- Re-
INSERT
survivors.
- Re-
CREATE
indexes and foreign keys.
Views can just stay, they have no impact on performance. More here or here.
Your question is a bit light on definitions. Assuming:
- You define the "least amount of information" with how many of the relevant columns are NULL.
- Primary key is
adr_id
.
- Duplicates are marked with a column
dupe_id
to indicate groups of duplicates.
Since it's also vague what to do exactly, I create a separate table with the dupe ranking:
CRATE TABLE adr_dupe_rank AS
SELECT adr_id, dupe_id
, rank() OVER (PARTITION BY dupe_id
ORDER BY (nr_of_beds IS NULL)::int
+ (nr_of_baths IS NULL)::int
-- + (...)::int -- more terms
) AS rnk
FROM address;
false
translate to 0
, true
translates to 1
. So rows with the fewest empty columns are ranked first. Master rows end up with rnk = 1
. Dupes get higher rnk
numbers.
The window function rank()
assigns 1 to all rows sharing the lowest score per dupe_id
. Add enough columns or other terms to break ties and get one winner per dupe_id
. Or deal with multiple winners separately.
You can then do as you please. To just delete dupes:
DELETE FROM address a
USING adr_dupe_rank d
WHERE a.adr_id = d.adr_id
AND d.rnk > 1;
Alternatively you can use the above query as derived table in the DELETE
directly.
Best Answer
Assuming your desired uniqueness constraint is
(one,two)
, here is one way to do it:DELETE FROM session.test WHERE (one, two) IN ( SELECT one, two FROM session.test GROUP BY one, two HAVING COUNT(*) > 1 )