Thesql replication goes out of sync for some tables

MySQLreplication

We are running mysql 5.1.61 on redhat systems and have the following setup

One master and four slaves replicating from the master, we recently added a new slave for replication and over a few days we have started noticing that on the newly added slave some tables ( not all ) loose some records , this happens only on this slave and it is not regular , over a period of 3 weeks this issue seems to have happened on 5-7 days .

We use statement based replication. I am not sure why this happens on only one slave. There seems to be no error in the mysql error logs. The only difference between the old slaves and the new slave is that the new slave has a slightly lower ram than the other ones but the new slave is not being used for anything right now.

Is there a way to trouble shoot this issue to see why this happens on only one slave ?. Could it be network related or anything else ? Any pointers on where to start looking at ?

Here is the memory info
Old slave

            total       used       free
 Mem:      24731204   24641572      89632   

New slave

Mem:      16431192   10112880    6318312 

Best Answer

You could be suffering from what is known as data drift.

QUERIES

This can happen if there are queries that are unsafe for replication.

One of the more common types is running UPDATE or DELETE using LIMIT. Using LIMIT on DML can work just fine on a Master. On a Slave, the rows selected (and perhaps certain ORDER BY choices) may not be the same set being updated or deleted as the set on the Master. See the MySQL Documentation for a Comprehensive Description of Unsafe Statements that can affect MySQL Replication.

Baron Schwartz once dealt with this and had to refactor his query to get around this

The following hypothetical scenario illustrates one way to introduce data drift:

BINLOGS

Master

  • 20 DB Connections writing changes (INSERTs,UPDATEs,DELETEs)
  • The I/O thread from a Slave has to serialize the SQL coming from the 20 DB Connections
  • The serializing of the queries into the binary logs may be in an order different from when each DB Connection executed its change.
  • sync_binlog set to 0 (default), which leaves the responsibility of flushing binlogs to disk in the hands of the OS

Slave

  • I/O Thread reads binlog events in the order the Master wrote them
  • SQL Thread executes binlog events in the order the Master wrote them

Observation

If binlogs are not flushed to Disk in a timely, predictable manner, any binlog events the Slave needs could easily be bypassed. This could cause data simply not exist on the Slave. Depending on the data recorded or not recorded, Replication's SQL Thread could break because of missing data or data that should be missing.

EPILOGUE

Not every Slave can be affected this way. Masters keep a list of all Slave I/O Threads and transmits binlog events to the Slave in order by ProcessID on the Master. I can see later slaves being victimized first.

If sync_binlog is indeed an issue, perhaps all Slaves have data drift and we just don't know of it yet.

The only way to tell is to download one of the following

or

and checksum everything on every Slave against the Master. You may find more data drift problems than you think. Just run the sync scripts to correct them.

CAVEAT

You suggested that network latency could be at issue. With binlogs not being flushed by the OS yet, any disconnect and reconnect of MySQL Replication due to latency or dropped packets is worth looking over as well. It could also be a major contributor to data drift.

Also to be noted is what network route the new Slave is communicating with back to the Master. If it is not the same route as the older Slaves (perhaps passing through a different switch, over public IP, etc.) needs to be investigated.