Mysql – Data drift on Row-Based-Replication

data synchronizationMySQLpercona-toolsreplication

i've realized that one of my mysql slaves has missing rows.
select count on specific table on master:
9605010
on slave:
9593831

slave is completely sync,
no seconds_behind_master,
no errors,
no slow log

version on both master and slaves:
mysql Ver 14.14 Distrib 5.6.35-80.0, for debian-linux-gnu (x86_64) using 6.3

sync_binlog=1,
row-based-replication

i'm not sure about using pt-table-checksum because, this is a prod system.
What could be cause the data drift on rbr anyway?

Best Answer

It's impossible to tell without investigation. Potentially it could be mistakes when you created the slave, it could be direct writes to the slave, bugs in replication, bugs in an app with SET binlog=0, etc.

Worse, it's very hard to investigate post-factum. You need to prepare for the data drift troubleshooting.

pt-table-checksum would be your first step. It's a mature tool, safe to use on prod, but problems are possible, too. Most likely, locking issues, deadlocks. If need to be extra paranoid, I set --chunk-time to something like 0.1 or even less.

Find differences in the data, review it (you may want to use twindb_table_compare to find which rows are missing/extra/different), and fix the mismatches (with pt-table-sync or rebuild the slave depending how big the data drift is).

And put pt-table-checksum in cron. If one run takes long time (say 1 - 2 days) I recommend to split discovery and alerting parts (i.e. the heavy pt-table-checksum populates percona.checksums and a light script checks percona.checksums and alerts if inconsistencies are found). On my systems I run pt-table-checksum continuously.

Then, when an inconsistency is discovered inspect the data, the binlog and then you'll figure out where it comes from.