MySQL Performance – Best Approach to Compare Specific Columns of Two Tables in Different Databases

data integrityMySQLperformance

I'm running an application who does have a shorten version of a table in the backend database.

Both application- and backend-tables are having the same primary keys and some specific key columns.

What would be the best approach to compare both tables to find unmatched records? (The tables are containing currently 1.5 Mio rows -> going up to 5 Mio rows)

Application-Table-Example
id (pk), column1 (pK), column2, column3, column4, column5, column6, column7, column8, column9, column10 [datetime]

Backend-Table-Example
id (pk), column1 (pK), column2, column10 [datetime]

Columns I want to compare
id (pk), column1 (pK), column2, DATE(column10) [datetime -> converted to DATE]

My current approach is to create a CRC32-Checksum by the columns "id (pk), column1 (pK), column2, column10" for each row in the application database and to save them in a temporary table with the structure "id (pk), column1 (pK) and hash (crc32)". Afterwards I run an union select in the backend database and create also a crc32-checksum by the backend table and combine it with the temporary table. By grouping them after all common tables I should get for each dataset 2 rows. If I get only 1 row for a dataset, we have found an unmatch.

SELECT MIN(TableName) as TableName, {$columns} FROM (
                SELECT 'APPLICATIONTABLE' as TableName, {$columns}
                FROM APPLICATIONTABLE
                UNION ALL
                SELECT 'BACKENDTABLE' as TableName, {$pks}, CRC32(CONCAT_WS({$data['fields']})) as hash
                FROM BACKENDTABLE
            ) tmp GROUP BY {$columns} HAVING COUNT(*) = 1

Do you have any suggestions for a better approach? (My current approach works, but it's maybe not the most performant one)

Frontend- and Backend-Database are on different MySQL-Servers
Federated tables are no option to skip the temporary table part
https://www.percona.com/doc/percona-toolkit/2.2/pt-table-sync.html looks like, that it doesn't support a partial comparison of explicit columns?
To some datetime fields I need to add fuzziness to obtain the same checksum (DATE(column10) ignores mismatches in second range)

Thanks a lot!

Best Answer

So, your approach is good in my opinion, with little tweaks:

First, when you create the temp table, or when you are generating the hash value, use DATE function (You mentioned is but not sure you're using it in the query):

SELECT ...., crc32(CONCAT(column1, DATE(column10))),

so you don't have to deal with 'fuzziness' [Assuming that you want the preciseness to be at day level]

To get the records that exist in one table, and not the other:

SELECT applicationtable.id, applicationtable.column1 
FROM applicationtable 
LEFT JOIN backendtable USING (id, column1) 
WHERE backendtable.id IS NULL [AND backendtable.column1 IS NULL]
UNION
SELECT backendtable.id, backendtable.column1 
FROM backendtable 
LEFT JOIN applicationtable USING (id, column1) 
WHERE applicationtable.id IS NULL [AND applicationtable.column1 IS NULL]

To get the records that have different values for the same PK:

SELECT applicationtable.id, applicationtable.column1 
FROM applicationtable 
INNER JOIN backendtable USING (id, column1) 
WHERE applicationtable.crc32_value <> backendtable.crc32_value

Be careful though:

If NULL value is allowed in column2 or in column10, CRC32 would give the same "null", even if the second field has a value, and this will not be reported by the query as "NULL <> NULL"
It is very unlikely to get collided CRC32 values, but it is not impossible. MD5 would be an option, but more expensive one.
Having [AND backendtable.column1 IS NULL] in the WHERE clause is not necessary if the field is defined to be NOT NULL

Related Solutions

Mysql – Primary key index with a DATETIME as first part of the compound key is never used

This is a bug in 5.5.x. See here

That suggests that your query should be

SELECT * 
FROM `stats`  
WHERE 
   dateDim = CAST('2014-04-03 00:00:00' as datetime)
   AND accountDim = 4
   AND execCodeDim = 9
   AND operationTypeDim = 1
   AND junkDim = 5
   AND ipCountryDim = 3

Mysql – Limit WHERE to MAX() & COUNT()

Here is your original query from the question

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM TAB_EVENT e 
LEFT JOIN TAB_MESSAGE m ON e.id=m.event_id 
WHERE ( m.`read` IS NULL OR m.`read` = 0) 
GROUP BY e.id 
ORDER BY m.datetime DESC, e.id ASC 
LIMIT 10;

Maybe try refactoring the query in such a way that in executes in this sequence

only collect necessary columns from TAB_MESSAGE
apply LIMIT 10 against the collected rows from TAB_MESSAGE
run the JOIN
apply the MAX() and COUNT() last

Here is what I am proposing

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM
(
    SELECT * FROM
    (SELECT id,event_id,datetime FROM TAB_MESSAGE
    WHERE read IS NULL OR read = 0
    ORDER BY datetime DESC) mm
    LIMIT 10
) m
LEFT JOIN TAB_EVENT e 
ON e.id=m.event_id
ORDER BY m.datetime DESC, e.id ASC;

Give it a Try !!!

UPDATE 2012-02-21 17:06 EDT

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM
TAB_EVENT e LEFT JOIN
(
    SELECT * FROM
    (SELECT id,event_id,datetime FROM TAB_MESSAGE
    WHERE read IS NULL OR read = 0
    ORDER BY datetime DESC) mm
    LIMIT 10
) m
ON e.id=m.event_id
ORDER BY m.datetime DESC, e.id ASC;

@Sebastian, I put the query back in the original join order. Please try this as well !!!

UPDATE 2012-02-21 17:11 EDT

Make sure the datetime field is indexed

ALTER TABLE TAB_MESSAGE ADD INDEX read_datetime_ndx (read,datetime);