Mysql – Getting slaves of a master-master setup stopped in sync

MySQLreplication

I have a master-master setup where each master has it's own slave. They layout might be illustrated as:

 M1 <===> M2
 |         |
 \/        \/
 S1        S2

That is both S1 and M2 slave from M1; both S2 and M1 slave from M2.

I'm wanting to get it to a point where S1 and S2 are stopped and should be in exact sync.

As I try thinking this out I keep running into gotchas that won't quite work out.

Initial niave approach:

Flush tables with read lock on both masters, ensure the slaves are all caught up
to their respective masters position and just stop slave. Even if that's scripted to happen one after another it's just a race condition where say a write got to M2 and subsequently S2, but had been been blocked from getting to S1 b/c of M1's read lock (which made it in before).

My second plan was:

  1. Stop slaving on S1, M2.
  2. Flush tables with read lock on M1.
  3. Start slave until M1's master log pos on S1,M2
  4. Stop slave on S2, M1.
  5. Flush tables with read lock on M2.
  6. Start slave until M2's master log pos on (S2,M1)
  7. Start slave on M1, M2, leaving s1,s2 stopped

The problem though is by step 4, there might have been writes on M2 which had since propagated to S2 but where blocked from getting to S1 b/c of M1's read lock.

Another idea:

  1. Stop slaving on M2 and S1.
  2. Start slaving on M2/S1 until some common position on M1.
  3. Change master on s1 to M2 (s1 and s2 are now slaving directly from M2)
  4. stop slaving s1, s2, start until common position on M2

Problem again being during the shuffling there could have been some M2 writes that made it to s2 but never s1.

Can anyway think of a clean way to tweak these approached to make it work?

One of my requirements is having minimal downtime for the Master-Master cycle while allowing more extended replication downtime for the slave-only instances.

Even if it's overly complicated what ever approach I go would with end up being scripted so if there's some kind of rapid looping/checking until just the right time that shouldn't be a problem.

Edit In Response to Rolondo's solution:

An end goal of what I'm trying to accomplish with this is to stop slaving (at least sql_thread execution) on S1 and S2 to have time to do some lengthy comparisons (checksuming etc).

A concern I have is after step 04 is even though these steps are stopping/starting the master SQL threads, the slaves (S1/S2) replication to their respective masters (M1/M2) is still going. There is a non zero amount of time that occurs between measuring/comparing the M's relationship with each other. Even if they look good by the time I would issue stop slave to S1/S2 something could have come into M1 or M2 and made it into just their slave.

Would you agree with this or am I missing something?

Best Answer

All of these approaches show that you gave these things a lot of thought.

You are worried about any pending changes when running FLUSH TABLES WITH READ LOCK;.

Think about this: When you issue FLUSH TABLES WITH READ LOCK;, how is replication affected? Recall that replication has two threads

  • IO Thread
  • SQL Thread

The IO Thread is responsible for communication between Master and Slave. It downloads binary log entries from the Master and stores them in the Slave's relay logs.

The SQL Thread is responsible for

  • reading the next SQL statement from the Slave's relay logs and processing them
  • maintain are temp tables created within the session of the SQL Thread

When you run FLUSH TABLES WITH READ LOCK;, only the SQL Thread gets affected because it needs to connect to tables. The IO Thread can still collect binary log entries from the Master and store them in the Slave's relay logs. Any replication lag will simply be caught off guard as is. In light of this, STOP SLAVE; should be faster than FLUSH TABLES WITH READ LOCK;. If you are concerned about pending changes, then use STOP SLAVE SQL_THREAD; instead of STOP SLAVE;. That way, whatever is last executed on each Master should be checked.

When you do SHOW SLAVE STATUS\G look for two lines

  • Relay_Master_Log_File (line 10)
  • Exec_Master_Log_Pos (line 22)

This tells you what was the SQL statement downloaded to the Slave that was last executed.

Knowing this, you could try the following

  • Step 01 : On M1 and M2, STOP SLAVE SQL_THREAD;
  • Step 02 : Run SHOW MASTER STATUS; on M1 and M2
  • Step 03 : Run SHOW SLAVE STATUS\G on M1 and M2
  • Step 04 : Evaluate this condition
    • Does M1's File = M2's Relay_Master_Log_File ?
    • Does M2's File = M1's Relay_Master_Log_File ?
    • Does M1's Position = M2's Exec_Master_Log_Pos ?
    • Does M2's Position = M1's Exec_Master_Log_Pos ?
  • Step 05 : If any one of the four conditions in Step 04 is not met
    • On M1 and M2, START SLAVE SQL_THREAD;
    • SELECT SLEEP(30);
    • Go Back to Step 01

If you get past Step 05 with all four conditions in Step 04, M1 and M2 are in sync.

Once M1 and M2 are frozen simultaneously

  • S1 should match M1
    • Wait until S1's Seconds_Behind_Master = 0
    • M1's File = S1's Relay_Master_Log_File
    • M1's Position = S1's Exec_Master_Log_Pos
  • S2 should match M2
    • Wait until S2's Seconds_Behind_Master = 0
    • M2's File = S2's Relay_Master_Log_File
    • M2's Position = S2's Exec_Master_Log_Pos
  • No need to run STOP SLAVE; on S1 or S2

I hope this helps

UPDATE 2012-05-11 17:30 EDT

Once S1 and S2 match up with their respective Master, you could STOP SLAVE; if you want to. Since M1 and M2 are frozen, no other changes can reach S1 or S2. Thus, STOP SLAVE; is not a requirement but you do so anyway.

UPDATE 2012-05-11 21:29 EDT

Your Comment

M1/M2 are frozen from receiving updates from one another but not from receiving a legit update from an external client/application, no?

Are you still accepting incoming feeds? You did say in the original question

As I try thinking this out I keep running into gotchas that won't quite work out.

That would certainly be one gotcha. Therefore, discontinue incoming feeds.

Since you want to do FLUSH TABLES WITH READ LOCK; to M1 and M2, I have one recommendation. Please set this one hour before syncing everything:

SET GLOBAL innodb_max_dirty_pages_pct = 0;

This will clear all dirty pages from the InnoDB Buffer Pool. That way, the time for FLUSH TABLES WITH READ LOCK; is as fast as possible. When all syncing is done, set it back to 90 (if running MySQL 5.5) or 75 (otherwise).

Your Comment

I could see how M1/M2 were locked if they flushed w/ read lock but it seemed your steps were not including such a step

I was not including such a step because I was under the impression you would disable outside feeds.