Improve the duplicate-detection queries

I have a DB table with US domestic flight data which I am in the process of sorta-kinda-de-duplicating. The schema appears here, but it is probably not very important. There are no perfectly-duplicate records; rather, I have a several fields which almost form a unique key: year_, month_, dayofmonth, uniquecarrier, flightnum, origin; and under 0.1% of the records share the same values on these fields with other records.

What I want to do is isolate these "pseudo-duplicates" or key-violators, i.e. pairs of records which have the same values on these columns.

Here's what I'm currently doing:

CREATE TEMPORARY TABLE dupe_keys AS (
    SELECT year_, month_, dayofmonth, uniquecarrier, flightnum, origin 
    FROM (
        SELECT count(*) AS cnt, year_, month_, dayofmonth, uniquecarrier, flightnum, origin 
        FROM ontime
        GROUP BY year_, month_, dayofmonth, uniquecarrier, flightnum, origin
    ) AS t WHERE cnt > 1
) ON COMMIT PRESERVE ROWS;

This query takes quite a lot of time (relative to an ideal possible implementation outside of a DBMS based on sorting the key columns). I then run:

CREATE TABLE dupes AS (
    SELECT  o.year_,o.quarter,o.month_,o.dayofmonth,o.dayofweek,o.flightdate,o.uniquecarrier,o.airlineid,o.carrier,o.tailnum,bunch_of,other_fields,go_here 
    FROM ontime AS o, dupe_keys AS dk
    WHERE 
            o.year_         = dk.year_ 
        AND o.month_        = dk.month_ 
        AND o.dayofmonth    = dk.dayofmonth 
        AND o.uniquecarrier = dk.uniquecarrier 
        AND o.flightnum     = dk.flightnum 
        AND o.origin        = dk.origin 
    ORDER BY o.year_ ASC, o.month_ ASC, o.dayofmonth ASC, o.uniquecarrier ASC, o.flightnum ASC, o.origin ASC
);

which obviously takes less time.

I would like to achieve the same, faster, with better queries. Alternatively, I wouldn't mind something like a pair of tables of the same size, each with one of the two dupes out of a dupe set.

Notes:

You may assume that no combination of key-column values appears more than twice. It's either once or twice.
I'm using MonetDB, a columnar in-memory DBMS. hence the syntax specifics.
I hope my question is not DBMS-specific, but in case it is – answers based on the behavior of other DBMSes are still relevant, just please state which DBMS you're assuming.
This query is run when the data set is loaded, rather than repeatedly, So pre-computation which takes even more time is not relevant.

CREATE TEMPORARY TABLE dupe_keys AS ( SELECT count(*) AS cnt, year_, month_, dayofmonth, uniquecarrier, flightnum, origin FROM ontime GROUP BY year_, month_, dayofmonth, uniquecarrier, flightnum, origin HAVING cnt > 1 );

Improve the duplicate-detection queries

Best Answer

Related Question

Best Answer

Related Solutions

SQL Server Query Performance – Optimizing Group By with Many Columns

Sql-server – Query looks optimal but logical reads is huge

Related Question