We are running mysql 5.1.61 on redhat systems and have the following setup
One master and four slaves replicating from the master, we recently added a new slave for replication and over a few days we have started noticing that on the newly added slave some tables ( not all ) loose some records , this happens only on this slave and it is not regular , over a period of 3 weeks this issue seems to have happened on 5-7 days .
We use statement based replication. I am not sure why this happens on only one slave. There seems to be no error in the mysql error logs. The only difference between the old slaves and the new slave is that the new slave has a slightly lower ram than the other ones but the new slave is not being used for anything right now.
Is there a way to trouble shoot this issue to see why this happens on only one slave ?. Could it be network related or anything else ? Any pointers on where to start looking at ?
Here is the memory info
Old slave
total used free
Mem: 24731204 24641572 89632
New slave
Mem: 16431192 10112880 6318312
Best Answer
You could be suffering from what is known as data drift.
QUERIES
This can happen if there are queries that are unsafe for replication.
One of the more common types is running
UPDATE
orDELETE
usingLIMIT
. UsingLIMIT
on DML can work just fine on a Master. On a Slave, the rows selected (and perhaps certainORDER BY
choices) may not be the same set being updated or deleted as the set on the Master. See the MySQL Documentation for a Comprehensive Description of Unsafe Statements that can affect MySQL Replication.Baron Schwartz once dealt with this and had to refactor his query to get around this
The following hypothetical scenario illustrates one way to introduce data drift:
BINLOGS
Master
Slave
Observation
If binlogs are not flushed to Disk in a timely, predictable manner, any binlog events the Slave needs could easily be bypassed. This could cause data simply not exist on the Slave. Depending on the data recorded or not recorded, Replication's SQL Thread could break because of missing data or data that should be missing.
EPILOGUE
Not every Slave can be affected this way. Masters keep a list of all Slave I/O Threads and transmits binlog events to the Slave in order by ProcessID on the Master. I can see later slaves being victimized first.
If sync_binlog is indeed an issue, perhaps all Slaves have data drift and we just don't know of it yet.
The only way to tell is to download one of the following
or
and checksum everything on every Slave against the Master. You may find more data drift problems than you think. Just run the sync scripts to correct them.
CAVEAT
You suggested that network latency could be at issue. With binlogs not being flushed by the OS yet, any disconnect and reconnect of MySQL Replication due to latency or dropped packets is worth looking over as well. It could also be a major contributor to data drift.
Also to be noted is what network route the new Slave is communicating with back to the Master. If it is not the same route as the older Slaves (perhaps passing through a different switch, over public IP, etc.) needs to be investigated.