Improve the duplicate-detection queries

duplicationmonetdbquery-performance

I have a DB table with US domestic flight data which I am in the process of sorta-kinda-de-duplicating. The schema appears here, but it is probably not very important. There are no perfectly-duplicate records; rather, I have a several fields which almost form a unique key: year_, month_, dayofmonth, uniquecarrier, flightnum, origin; and under 0.1% of the records share the same values on these fields with other records.

What I want to do is isolate these "pseudo-duplicates" or key-violators, i.e. pairs of records which have the same values on these columns.

Here's what I'm currently doing:

CREATE TEMPORARY TABLE dupe_keys AS (
    SELECT year_, month_, dayofmonth, uniquecarrier, flightnum, origin 
    FROM (
        SELECT count(*) AS cnt, year_, month_, dayofmonth, uniquecarrier, flightnum, origin 
        FROM ontime
        GROUP BY year_, month_, dayofmonth, uniquecarrier, flightnum, origin
    ) AS t WHERE cnt > 1
) ON COMMIT PRESERVE ROWS;

This query takes quite a lot of time (relative to an ideal possible implementation outside of a DBMS based on sorting the key columns). I then run:

CREATE TABLE dupes AS (
    SELECT  o.year_,o.quarter,o.month_,o.dayofmonth,o.dayofweek,o.flightdate,o.uniquecarrier,o.airlineid,o.carrier,o.tailnum,bunch_of,other_fields,go_here 
    FROM ontime AS o, dupe_keys AS dk
    WHERE 
            o.year_         = dk.year_ 
        AND o.month_        = dk.month_ 
        AND o.dayofmonth    = dk.dayofmonth 
        AND o.uniquecarrier = dk.uniquecarrier 
        AND o.flightnum     = dk.flightnum 
        AND o.origin        = dk.origin 
    ORDER BY o.year_ ASC, o.month_ ASC, o.dayofmonth ASC, o.uniquecarrier ASC, o.flightnum ASC, o.origin ASC
);

which obviously takes less time.

I would like to achieve the same, faster, with better queries. Alternatively, I wouldn't mind something like a pair of tables of the same size, each with one of the two dupes out of a dupe set.

Notes:

  • You may assume that no combination of key-column values appears more than twice. It's either once or twice.
  • I'm using MonetDB, a columnar in-memory DBMS. hence the syntax specifics.
  • I hope my question is not DBMS-specific, but in case it is – answers based on the behavior of other DBMSes are still relevant, just please state which DBMS you're assuming.
  • This query is run when the data set is loaded, rather than repeatedly, So pre-computation which takes even more time is not relevant.

Best Answer

In your query to create the temporary table, it's possible to do that in standard SQL without a sub-query, using a HAVING clause, e.g.:

CREATE TEMPORARY TABLE dupe_keys AS (
    SELECT count(*) AS cnt, year_, month_, dayofmonth, uniquecarrier, flightnum, origin 
    FROM ontime
    GROUP BY year_, month_, dayofmonth, uniquecarrier, flightnum, origin
    HAVING cnt > 1
);

I don't know if that will speed up the temporary table creation, but it might be worth a try.