MySQL Performance – Best Approach to Compare Specific Columns of Two Tables in Different Databases

data integrityMySQLperformance

I'm running an application who does have a shorten version of a table in the backend database.

Both application- and backend-tables are having the same primary keys and some specific key columns.

What would be the best approach to compare both tables to find unmatched records? (The tables are containing currently 1.5 Mio rows -> going up to 5 Mio rows)

Application-Table-Example

id (pk), column1 (pK), column2, column3, column4, column5, column6, column7, column8, column9, column10 [datetime]

Backend-Table-Example

id (pk), column1 (pK), column2, column10 [datetime]

Columns I want to compare

id (pk), column1 (pK), column2, DATE(column10) [datetime -> converted to DATE]

My current approach is to create a CRC32-Checksum by the columns "id (pk), column1 (pK), column2, column10" for each row in the application database and to save them in a temporary table with the structure "id (pk), column1 (pK) and hash (crc32)". Afterwards I run an union select in the backend database and create also a crc32-checksum by the backend table and combine it with the temporary table. By grouping them after all common tables I should get for each dataset 2 rows. If I get only 1 row for a dataset, we have found an unmatch.

SELECT MIN(TableName) as TableName, {$columns} FROM (
                SELECT 'APPLICATIONTABLE' as TableName, {$columns}
                FROM APPLICATIONTABLE
                UNION ALL
                SELECT 'BACKENDTABLE' as TableName, {$pks}, CRC32(CONCAT_WS({$data['fields']})) as hash
                FROM BACKENDTABLE
            ) tmp GROUP BY {$columns} HAVING COUNT(*) = 1

Do you have any suggestions for a better approach? (My current approach works, but it's maybe not the most performant one)

  • Frontend- and Backend-Database are on different MySQL-Servers
  • Federated tables are no option to skip the temporary table part
  • https://www.percona.com/doc/percona-toolkit/2.2/pt-table-sync.html looks like, that it doesn't support a partial comparison of explicit columns?
  • To some datetime fields I need to add fuzziness to obtain the same checksum (DATE(column10) ignores mismatches in second range)

Thanks a lot!

Best Answer

So, your approach is good in my opinion, with little tweaks:

First, when you create the temp table, or when you are generating the hash value, use DATE function (You mentioned is but not sure you're using it in the query):

SELECT ...., crc32(CONCAT(column1, DATE(column10))),

so you don't have to deal with 'fuzziness' [Assuming that you want the preciseness to be at day level]

To get the records that exist in one table, and not the other:

SELECT applicationtable.id, applicationtable.column1 
FROM applicationtable 
LEFT JOIN backendtable USING (id, column1) 
WHERE backendtable.id IS NULL [AND backendtable.column1 IS NULL]
UNION
SELECT backendtable.id, backendtable.column1 
FROM backendtable 
LEFT JOIN applicationtable USING (id, column1) 
WHERE applicationtable.id IS NULL [AND applicationtable.column1 IS NULL]

To get the records that have different values for the same PK:

SELECT applicationtable.id, applicationtable.column1 
FROM applicationtable 
INNER JOIN backendtable USING (id, column1) 
WHERE applicationtable.crc32_value <> backendtable.crc32_value

Be careful though:

  • If NULL value is allowed in column2 or in column10, CRC32 would give the same "null", even if the second field has a value, and this will not be reported by the query as "NULL <> NULL"
  • It is very unlikely to get collided CRC32 values, but it is not impossible. MD5 would be an option, but more expensive one.
  • Having [AND backendtable.column1 IS NULL] in the WHERE clause is not necessary if the field is defined to be NOT NULL