Thesql replication goes out of sync for some tables

MySQLreplication

We are running mysql 5.1.61 on redhat systems and have the following setup

One master and four slaves replicating from the master, we recently added a new slave for replication and over a few days we have started noticing that on the newly added slave some tables ( not all ) loose some records , this happens only on this slave and it is not regular , over a period of 3 weeks this issue seems to have happened on 5-7 days .

We use statement based replication. I am not sure why this happens on only one slave. There seems to be no error in the mysql error logs. The only difference between the old slaves and the new slave is that the new slave has a slightly lower ram than the other ones but the new slave is not being used for anything right now.

Is there a way to trouble shoot this issue to see why this happens on only one slave ?. Could it be network related or anything else ? Any pointers on where to start looking at ?

Here is the memory info
Old slave

            total       used       free
 Mem:      24731204   24641572      89632

New slave

Mem:      16431192   10112880    6318312

Best Answer

You could be suffering from what is known as data drift.

QUERIES

This can happen if there are queries that are unsafe for replication.

One of the more common types is running UPDATE or DELETE using LIMIT. Using LIMIT on DML can work just fine on a Master. On a Slave, the rows selected (and perhaps certain ORDER BY choices) may not be the same set being updated or deleted as the set on the Master. See the MySQL Documentation for a Comprehensive Description of Unsafe Statements that can affect MySQL Replication.

Baron Schwartz once dealt with this and had to refactor his query to get around this

The following hypothetical scenario illustrates one way to introduce data drift:

BINLOGS

Master

20 DB Connections writing changes (INSERTs,UPDATEs,DELETEs)
The I/O thread from a Slave has to serialize the SQL coming from the 20 DB Connections
The serializing of the queries into the binary logs may be in an order different from when each DB Connection executed its change.
sync_binlog set to 0 (default), which leaves the responsibility of flushing binlogs to disk in the hands of the OS

Slave

I/O Thread reads binlog events in the order the Master wrote them
SQL Thread executes binlog events in the order the Master wrote them

Observation

If binlogs are not flushed to Disk in a timely, predictable manner, any binlog events the Slave needs could easily be bypassed. This could cause data simply not exist on the Slave. Depending on the data recorded or not recorded, Replication's SQL Thread could break because of missing data or data that should be missing.

EPILOGUE

Not every Slave can be affected this way. Masters keep a list of all Slave I/O Threads and transmits binlog events to the Slave in order by ProcessID on the Master. I can see later slaves being victimized first.

If sync_binlog is indeed an issue, perhaps all Slaves have data drift and we just don't know of it yet.

The only way to tell is to download one of the following

and checksum everything on every Slave against the Master. You may find more data drift problems than you think. Just run the sync scripts to correct them.

CAVEAT

You suggested that network latency could be at issue. With binlogs not being flushed by the OS yet, any disconnect and reconnect of MySQL Replication due to latency or dropped packets is worth looking over as well. It could also be a major contributor to data drift.

Also to be noted is what network route the new Slave is communicating with back to the Master. If it is not the same route as the older Slaves (perhaps passing through a different switch, over public IP, etc.) needs to be investigated.

Related Solutions

Mysql – Is MySQL Replication appropriate for keeping a laptop in sync

Replication is a tool specifically designed for load handling and failover--not well suited to online/offline syncing. Most replication systems fail immediately on losing connection with a host, and it is up to the admin to resolve the situation, hopefully while one or more of the machines in the cluster are still functional.

What you need is more of a snapshot/restore, or possibly a queuing method. I had one of my team write a simple tool for point-in-time recovery on mysql that may meet your need, although it isn't maintained. We use it in many of our applications.

It uses binary logging to bring hosts up to any point in time. I would suggest continuing replication with the slave server, and using a tool like this for hosts that go offline occasionally.

https://github.com/bryanagee/Pamiris-MySQL-Backup/branches

MySQL replication – some values are not being replicated

Are you aware of http://www.mysqlperformanceblog.com/2009/05/14/why-mysqls-binlog-do-db-option-is-dangerous/ ?

Perhaps some of the missing updates are being run with the test database as the current database?

It is less error-prone if you use the replicate-wild-* options. In your case

    replicate-wild-ignore-table = mysql.*
    replicate-wild-ignore-table = test.*