I have a master-master setup where each master has it's own slave. They layout might be illustrated as:
M1 <===> M2
| |
\/ \/
S1 S2
That is both S1 and M2 slave from M1; both S2 and M1 slave from M2.
I'm wanting to get it to a point where S1 and S2 are stopped and should be in exact sync.
As I try thinking this out I keep running into gotchas that won't quite work out.
Initial niave approach:
Flush tables with read lock on both masters, ensure the slaves are all caught up
to their respective masters position and just stop slave. Even if that's scripted to happen one after another it's just a race condition where say a write got to M2 and subsequently S2, but had been been blocked from getting to S1 b/c of M1's read lock (which made it in before).
My second plan was:
- Stop slaving on S1, M2.
- Flush tables with read lock on M1.
- Start slave until M1's master log pos on S1,M2
- Stop slave on S2, M1.
- Flush tables with read lock on M2.
- Start slave until M2's master log pos on (S2,M1)
- Start slave on M1, M2, leaving s1,s2 stopped
The problem though is by step 4, there might have been writes on M2 which had since propagated to S2 but where blocked from getting to S1 b/c of M1's read lock.
Another idea:
- Stop slaving on M2 and S1.
- Start slaving on M2/S1 until some common position on M1.
- Change master on s1 to M2 (s1 and s2 are now slaving directly from M2)
- stop slaving s1, s2, start until common position on M2
Problem again being during the shuffling there could have been some M2 writes that made it to s2 but never s1.
Can anyway think of a clean way to tweak these approached to make it work?
One of my requirements is having minimal downtime for the Master-Master cycle while allowing more extended replication downtime for the slave-only instances.
Even if it's overly complicated what ever approach I go would with end up being scripted so if there's some kind of rapid looping/checking until just the right time that shouldn't be a problem.
Edit In Response to Rolondo's solution:
An end goal of what I'm trying to accomplish with this is to stop slaving (at least sql_thread execution) on S1 and S2 to have time to do some lengthy comparisons (checksuming etc).
A concern I have is after step 04 is even though these steps are stopping/starting the master SQL threads, the slaves (S1/S2) replication to their respective masters (M1/M2) is still going. There is a non zero amount of time that occurs between measuring/comparing the M's relationship with each other. Even if they look good by the time I would issue stop slave to S1/S2 something could have come into M1 or M2 and made it into just their slave.
Would you agree with this or am I missing something?
Best Answer
All of these approaches show that you gave these things a lot of thought.
You are worried about any pending changes when running
FLUSH TABLES WITH READ LOCK;
.Think about this: When you issue
FLUSH TABLES WITH READ LOCK;
, how is replication affected? Recall that replication has two threadsThe IO Thread is responsible for communication between Master and Slave. It downloads binary log entries from the Master and stores them in the Slave's relay logs.
The SQL Thread is responsible for
When you run
FLUSH TABLES WITH READ LOCK;
, only the SQL Thread gets affected because it needs to connect to tables. The IO Thread can still collect binary log entries from the Master and store them in the Slave's relay logs. Any replication lag will simply be caught off guard as is. In light of this,STOP SLAVE;
should be faster thanFLUSH TABLES WITH READ LOCK;
. If you are concerned about pending changes, then useSTOP SLAVE SQL_THREAD;
instead ofSTOP SLAVE;
. That way, whatever is last executed on each Master should be checked.When you do
SHOW SLAVE STATUS\G
look for two linesThis tells you what was the SQL statement downloaded to the Slave that was last executed.
Knowing this, you could try the following
STOP SLAVE SQL_THREAD;
SHOW MASTER STATUS;
on M1 and M2SHOW SLAVE STATUS\G
on M1 and M2START SLAVE SQL_THREAD;
If you get past Step 05 with all four conditions in Step 04, M1 and M2 are in sync.
Once M1 and M2 are frozen simultaneously
STOP SLAVE;
on S1 or S2I hope this helps
UPDATE 2012-05-11 17:30 EDT
Once S1 and S2 match up with their respective Master, you could
STOP SLAVE;
if you want to. Since M1 and M2 are frozen, no other changes can reach S1 or S2. Thus,STOP SLAVE;
is not a requirement but you do so anyway.UPDATE 2012-05-11 21:29 EDT
Your Comment
Are you still accepting incoming feeds? You did say in the original question
That would certainly be one gotcha. Therefore, discontinue incoming feeds.
Since you want to do
FLUSH TABLES WITH READ LOCK;
to M1 and M2, I have one recommendation. Please set this one hour before syncing everything:This will clear all dirty pages from the InnoDB Buffer Pool. That way, the time for
FLUSH TABLES WITH READ LOCK;
is as fast as possible. When all syncing is done, set it back to 90 (if running MySQL 5.5) or 75 (otherwise).Your Comment
I was not including such a step because I was under the impression you would disable outside feeds.