MySQL row-based replication – Delete Fails

MySQLmysql-5.6replication

We have a Master – Master replication setup, MySQL 5.6.
Only one Master is used, the other is for backup and failover (we'll call that the slave).
The binlog_format is set to ROW, autoincrement settings are made to avoid conflicts.

The problem: Slave is halted due to Duplicate key errors.

We debugged the cause to be that bulk deletes made by cron jobs on the Master did not run (completely?!?) on slave. We are talking about tens of thousands of records that are NOT deleted from the slave. We did not find errors in the MySQL error log.

That leads to unsyncronized replica and errors when inserts are made using PK that should have been deleted on the Slave.

The tables are MyISAM.

Any idea why the bulk delete doesn't propagate properly on the replica?

Best Answer

In the absence of any obvious, logical reason why this might be happening, I'm going to shamelessly invoke the MyISAM boogey-man.

This question reminds me of one from a few years back. It's not a duplicate question, but the underlying mechanisms could be similar. The workaround I provided in that case was specifically based on the premise of duplicate "unique" values existing in a table.

I suspect you have latent, undetected defects in your MyISAM tables, where there are actually -- via an unknown (to me) mechanism -- duplicate primary (or unique) key values hiding below the surface.

When MyISAM selects or deletes a row by primary key, it won't see these duplicates, because as soon as it finds one row, it stops looking, because there "can't" be more to find... yet, if you SELECT * you'll get the duplicates... and deleting one of the rows would then unmask the other.

One way to test this that might work would be this:

mysql> CREATE TABLE test_table LIKE real_table;
mysql> INSERT INTO test_table SELECT * FROM real_table;

If this is indeed at the root of what's going on, you might get "lucky" and get a duplicate key error, which would prove the theory... because a duplicate key error should be impossible if the original table's data is intact.

You could, of course, review the binary logs using mysqlbinlog, to confirm that the deletions were logged... but I suspect you will find that the deletes did get logged correctly... but after the rows were deleted, the phantom rows were then visible, and caused the subsequent replication error.