MySQL replication stopped for over 10 days, will replication work correctly when expire_logs_days=10

MySQLreplication

I'm foggy on exactly how replication works. Here's my scenario.

Today (April 6th) I realized that our replication slave had an error caused by a failed "CREATE USER" command on March 20. Since then, no replication has happened.

This leaves a six day or so gap where expire_logs_days would have expired logs on the master, and the slave was not replicating.

I was able to skip the error and get replication going again. It all caught up according to SHOW SLAVE STATUS\G.

How can I check that things are replicated correctly, beyond "spot checks?" For example, the number of posts on master/slave are identical, that's encouraging.

But when I run, for example, this query on each database, I do not get the same answers:

select count(*) from table where datestamp >= '2016-03-24 00:00:00' AND datestamp <= '2016-03-24 23:59:59';

Should I be concerned that data is missing/incorrect? Is there a reason why the timestamps would be off between databases?

I'd prefer not to have to re-set up the replication, but am prepared to do so.

Thank you.

Best Answer

The reason you were able to issue a single skip slave statement and it caught up, is because the SQL thread stopped, but the IO thread continues.

As long as you see:

Slave_IO_Running: Yes

when you run SHOW SLAVE STATUS, then the slave is able to continually pull the binary logs from the master and store them into the relay logs.

Regardless of expire_logs_days setting on the master and if the binlogs disappear, as long as they have been pulled and stored into the relay logs on the slave, they can be replayed.

One caveat being that on very busy servers, replaying over 10 days worth of relay logs would be much longer than doing a fresh restore.

Regarding the differences in your counts, you could be having other issues unrelated to this replication event. It is tough to know precisely without access to binlogs, schema and my.cnf.

Some rough debugging steps:

Do count of days prior to the replication event differ at all? Do other days after differ?
Use a tool to get differences in slave and master. I recommend pt-table-checksum, but be aware of the Limitation of servers running in RBR/MIXED binlog_format:

pt-table-checksum requires statement-based replication, and it sets binlog_format=STATEMENT on the master, but due to a MySQL limitation replicas do not honor this change. Therefore, checksums will not replicate past any replicas using row-based replication that are masters for further replicas.
Debug the root cause of any differences found by the checksum process. This could be many reasons. A couple that come to mind is using statement-based replication in some situations, or different timezone settings on master and slave.
To resolve immediate differences, you can use pt-table-sync or restore a fresh copy of the master.

Related Solutions

Thesql replication deadlock slave stopped

This query fails on slave: 'insert into archieved(eid,at,rtime,scode) select id,at,avg(rtime),scode from tstatus where date(at)="2011-10-29" group by id'

You report a strange Error_code: 0, but try to run this query on slave? When it run succesfully then run as root:

SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
SLAVE START;

Mysql – Should I disable triggers, timestamps in a slave in thesql

Do not disable the triggers on your slaves, and do not configure your slaves to skip errors.

When replication is set up correctly with consistent data sets and identical configurations, including tables and their data, views, triggers, stored functions, and stored procedures, it just works. If it isn't "just working," then it was already not consistent in one of these ways. The duplicate key errors are a red flag that your data is not identical, and you don't want to suppress this.

If you aren't already, you should be using binlog_format = MIXED unless you have a specific reason not to. The default value prior to MySQL 5.6 was STATEMENT but this was largely for legacy reasons.

Using mixed logging allows the optimizer on the master to choose the best way to log each change to the master's data -- either by:

logging the literal and exact query your client sent to the master, so that the slaves can execute that same query, thereby modifying their data (statement based logging), or
logging the specific rows that were changed on the master so that the slave can change those same rows in the affected table without regard to the actual query that caused the changes (row based-logging).

If a statement is determined to be "unsafe" for statement-based logging (meaning that it could potentially cause the data to diverge from its initial identical state), it will be automatically logged as row-based, if you're using MIXED mode. Examples of this would be statements involving calls to non-deterministic functions like UUID() which would return different values on master and slave. Row-based logging handles this without a problem, since it passes the literal values across.

You might think NOW() might also be such a function, and that a similar limitation might apply to automatic timestamp columns, if the server clocks are not in sync, but that's not the case -- because with each binary (replication) log entry, the master logs what its current timestamp was at the time the query was executed -- allowing the slave to have a "pseudo-system clock" to actually fudge-in the master's clock's value at the time that statement was executed when needed. This log entry, incidentally, is how the slave calculates "seconds behind master," by comparing its system clock to the logged timestamps. (The exception to this is when the slave has executed everything in its relay logs, it always shows "0" seconds behind, because it doesn't know whether the former logs were delayed by clock skew or by an actual replication lag -- so if it has run out of things to execute, it is no longer technically "behind").

If a query is logged using row-based logging, then any data changed by triggers are also logged as row-based log entries, and the triggers on the slave will automatically not fire.

On the other hand, if the same query is logged using statement-based logging, then replication depends on the identical triggers being present on the slave, because the master will not log what the trigger did -- it only logs the query issued, and depends on the slave triggers firing exactly as they did on the master, assuming you want the data on all servers to be consistent.

With statement-based replication, triggers executed on the master also execute on the slave. With row-based replication, triggers executed on the master do not execute on the slave. Instead, the row changes on the master resulting from trigger execution are replicated and applied on the slave.

This behavior is by design. [...]

http://dev.mysql.com/doc/refman/5.5/en/replication-features-triggers.html

The documentation often speaks of replication in what sounds like terms of either/or (statement or row) but for each query executed MIXED logging will choose the logging method on a per-query basis. You may also find statements to the effect that row-based logging is "required" in some conditions but this only means that it must be available (i.e., via MIXED); it does not mean that the server must be configured for ROW. DDL, of course, is always logged as statements, regardless of the binlog format.

Best Answer

Related Solutions

Thesql replication deadlock slave stopped

Mysql – Should I disable triggers, timestamps in a slave in thesql

Related Question