I have a DB table with US domestic flight data which I am in the process of sorta-kinda-de-duplicating. The schema appears here, but it is probably not very important. There are no perfectly-duplicate records; rather, I have a several fields which almost form a unique key: year_, month_, dayofmonth, uniquecarrier, flightnum, origin; and under 0.1% of the records share the same values on these fields with other records.
What I want to do is isolate these "pseudo-duplicates" or key-violators, i.e. pairs of records which have the same values on these columns.
Here's what I'm currently doing:
CREATE TEMPORARY TABLE dupe_keys AS (
SELECT year_, month_, dayofmonth, uniquecarrier, flightnum, origin
FROM (
SELECT count(*) AS cnt, year_, month_, dayofmonth, uniquecarrier, flightnum, origin
FROM ontime
GROUP BY year_, month_, dayofmonth, uniquecarrier, flightnum, origin
) AS t WHERE cnt > 1
) ON COMMIT PRESERVE ROWS;
This query takes quite a lot of time (relative to an ideal possible implementation outside of a DBMS based on sorting the key columns). I then run:
CREATE TABLE dupes AS (
SELECT o.year_,o.quarter,o.month_,o.dayofmonth,o.dayofweek,o.flightdate,o.uniquecarrier,o.airlineid,o.carrier,o.tailnum,bunch_of,other_fields,go_here
FROM ontime AS o, dupe_keys AS dk
WHERE
o.year_ = dk.year_
AND o.month_ = dk.month_
AND o.dayofmonth = dk.dayofmonth
AND o.uniquecarrier = dk.uniquecarrier
AND o.flightnum = dk.flightnum
AND o.origin = dk.origin
ORDER BY o.year_ ASC, o.month_ ASC, o.dayofmonth ASC, o.uniquecarrier ASC, o.flightnum ASC, o.origin ASC
);
which obviously takes less time.
I would like to achieve the same, faster, with better queries. Alternatively, I wouldn't mind something like a pair of tables of the same size, each with one of the two dupes out of a dupe set.
Notes:
- You may assume that no combination of key-column values appears more than twice. It's either once or twice.
- I'm using MonetDB, a columnar in-memory DBMS. hence the syntax specifics.
- I hope my question is not DBMS-specific, but in case it is – answers based on the behavior of other DBMSes are still relevant, just please state which DBMS you're assuming.
- This query is run when the data set is loaded, rather than repeatedly, So pre-computation which takes even more time is not relevant.
Best Answer
In your query to create the temporary table, it's possible to do that in standard SQL without a sub-query, using a
HAVING
clause, e.g.:I don't know if that will speed up the temporary table creation, but it might be worth a try.