MySQL replication stopped for over 10 days, will replication work correctly when expire_logs_days=10

MySQLreplication

I'm foggy on exactly how replication works. Here's my scenario.

Today (April 6th) I realized that our replication slave had an error caused by a failed "CREATE USER" command on March 20. Since then, no replication has happened.

This leaves a six day or so gap where expire_logs_days would have expired logs on the master, and the slave was not replicating.

I was able to skip the error and get replication going again. It all caught up according to SHOW SLAVE STATUS\G.

How can I check that things are replicated correctly, beyond "spot checks?" For example, the number of posts on master/slave are identical, that's encouraging.

But when I run, for example, this query on each database, I do not get the same answers:

select count(*) from table where datestamp >= '2016-03-24 00:00:00' AND datestamp <= '2016-03-24 23:59:59';

Should I be concerned that data is missing/incorrect? Is there a reason why the timestamps would be off between databases?

I'd prefer not to have to re-set up the replication, but am prepared to do so.

Thank you.

Best Answer

The reason you were able to issue a single skip slave statement and it caught up, is because the SQL thread stopped, but the IO thread continues.

As long as you see:

Slave_IO_Running: Yes

when you run SHOW SLAVE STATUS, then the slave is able to continually pull the binary logs from the master and store them into the relay logs.

Regardless of expire_logs_days setting on the master and if the binlogs disappear, as long as they have been pulled and stored into the relay logs on the slave, they can be replayed.

One caveat being that on very busy servers, replaying over 10 days worth of relay logs would be much longer than doing a fresh restore.


Regarding the differences in your counts, you could be having other issues unrelated to this replication event. It is tough to know precisely without access to binlogs, schema and my.cnf.

Some rough debugging steps:

  • Do count of days prior to the replication event differ at all? Do other days after differ?
  • Use a tool to get differences in slave and master. I recommend pt-table-checksum, but be aware of the Limitation of servers running in RBR/MIXED binlog_format:

    pt-table-checksum requires statement-based replication, and it sets binlog_format=STATEMENT on the master, but due to a MySQL limitation replicas do not honor this change. Therefore, checksums will not replicate past any replicas using row-based replication that are masters for further replicas.

  • Debug the root cause of any differences found by the checksum process. This could be many reasons. A couple that come to mind is using statement-based replication in some situations, or different timezone settings on master and slave.

  • To resolve immediate differences, you can use pt-table-sync or restore a fresh copy of the master.