Efficient matching rows across two tables with huge data set

optimizationoracle

I have two tables with about 300+ columns and 20k+ rows each. Each row is characterized by a primary key which both tables have.I have to find the identical rows, rows with changes,and rows without a corresponding match in the second table(i.e there is a row with primary id=x in table 1 and no row has primary id=x in table 2). Attributing to this huge data set, can somebody suggest an "efficient" method to solve this.

Best Answer

Without making any schema change, your solution requires a table scan on both sides - that's the most efficient you'll be able to do. You simply have to LEFT JOIN A to B and compare every field.

However, if you expect to do this in the future, you can do a couple things to make life easier: you can add a calculated hash, which will almost always detect a change. Compare this field, and you can identify changes. Add an index to each table with the PK field(s) and this hash field, and you can very quickly find new and changed records:

SELECT
    A.PK,
    CASE
        WHEN B.PK IS NULL THEN 'New'
        WHEN A.Hash <> B.Hash THEN 'Changed'
        ELSE 'Unchanged' END AS Status
FROM
    A
    LEFT JOIN B ON A.PK = B.PK

Microsoft SQL offers a ROWVERSION construct which makes this type of operation simple; it looks like the equivalent in Oracle is system change numbers: https://stackoverflow.com/a/20488598/565869.

If for any reason that doesn't work for you, you could always roll your own using triggers or by modifying the sprocs which update these tables (you are wrapping your UPDATEs in sprocs, right?)