Mysql – Screwed up replication by sharing server ids

MySQLreplication

So I had my screwup for the week today. I was adding another pair of slaves and set the same server ID on the new slave as an old slave.

The layout kind of looks like

Master
|     |
\/    |
oldS  |
      \/
      newS1
      |
      |
      \/
      newS2

So to be clear the old slave (oldS) and the first new slave (newS1) share the same server ID.

It's not circular replication so I'm hoping things will turn out okay. I wouldn't have expected the fallout though.

Alarms started going off b/c oldS1 started falling farther and farther behind. Looking at the logdir it was making thousands and thousands of empty relay logs.

I stopped slaving on newS1 and that seemed to clear things up in that oldS1 stopped making empty relay logs and caught back up.

Both slaves seem to be in a consistent state up to the point I stopped slaving on newS1.

Will fixing everything be as simple as bouncing newS1 with a new, unique ID be kosher, especially considering newS1 is itself slaving to newS2?
Is there anything else to be cautious about?
Why did this result in oldS spawning empty relay log after empty relay log? I would have though oldS and newS1 had no knowledge of the others existence.
I thought relay log rolling was just determined by the slave itself. Is the master sending some signal that it should spawn a new relay log?

Best Answer

From my standpoint, you may have potentially introduced data drift into replication.

Baron Schwartz presented this as a puzzle in his blog.

You may have to reload oldS and newS1 with fresh data.

At the very least, you should use pt-table-checksum to see if

the data on Master differs from oldS
the data on Master differs from newS1

If the differences are not identical, you can either

reload oldS and newS1 fresh
run pt-table-sync

Before you touch anything, please fix the server-id situation

The spawning of relay logs is to be expected because sibling slaves take turns getting SQL entries from the Master. They simply cannot share the same server-id. The Master will somehow alert subsequent slaves that I gave server-id an SQL statement already. Thus, the I/O thread on subsequent slaves will disconnect and retry. Consequently, empty relay logs increase. (Trust me, I have shot myself in the foot with this years ago).

This methodology of the Master talking to Slaves for this info allowed MySQL (eh, Oracle) to come up with semisync replication. This would break semisync replication as well. Even though MySQL 5.6 will soon introduce a Global Transaction ID into the mix, server-id will still be used in its method of checks-and-balances on the Master. After all, if an eagle had two eaglets, no eagle would spit into two mouths at the same time in order to feed them.

Related Solutions

Mysql – Adding a new slave slowing down previous replicas

Your new slave server can possibly be slow in some respect and not able to process the queries in the binary log as quickly as the other servers.

You should also look at the possibility that you may have run a very heavy query that caused the new slave to lag as well as the other slaves.

Regarding the new slave, you may have a performance problem that needs to be resolved. Possibilities could include:

Slow hard disks (you mentioned you are buying new ones)
Bad my.cnf configuration. Flushes too often to disk? flush method isn't O_DIRECT?
Long running queries from master server - consider using row-based replication if you feel it helps your overall system performance.
Caches on the new slave aren't warm for running the queries from the master - do you use INSERT INTO ... SELECT .. FROM statements? or non-deterministic insert statements like ones with sub-queries?
Raid card with battery - do the other servers have it and the slave doesn't?

Mysql – Getting slaves of a master-master setup stopped in sync

All of these approaches show that you gave these things a lot of thought.

You are worried about any pending changes when running FLUSH TABLES WITH READ LOCK;.

Think about this: When you issue FLUSH TABLES WITH READ LOCK;, how is replication affected? Recall that replication has two threads

IO Thread
SQL Thread

The IO Thread is responsible for communication between Master and Slave. It downloads binary log entries from the Master and stores them in the Slave's relay logs.

The SQL Thread is responsible for

reading the next SQL statement from the Slave's relay logs and processing them
maintain are temp tables created within the session of the SQL Thread

When you run FLUSH TABLES WITH READ LOCK;, only the SQL Thread gets affected because it needs to connect to tables. The IO Thread can still collect binary log entries from the Master and store them in the Slave's relay logs. Any replication lag will simply be caught off guard as is. In light of this, STOP SLAVE; should be faster than FLUSH TABLES WITH READ LOCK;. If you are concerned about pending changes, then use STOP SLAVE SQL_THREAD; instead of STOP SLAVE;. That way, whatever is last executed on each Master should be checked.

When you do SHOW SLAVE STATUS\G look for two lines

Relay_Master_Log_File (line 10)
Exec_Master_Log_Pos (line 22)

This tells you what was the SQL statement downloaded to the Slave that was last executed.

Knowing this, you could try the following

Step 01 : On M1 and M2, STOP SLAVE SQL_THREAD;
Step 02 : Run SHOW MASTER STATUS; on M1 and M2
Step 03 : Run SHOW SLAVE STATUS\G on M1 and M2
Step 04 : Evaluate this condition
- Does M1's File = M2's Relay_Master_Log_File ?
- Does M2's File = M1's Relay_Master_Log_File ?
- Does M1's Position = M2's Exec_Master_Log_Pos ?
- Does M2's Position = M1's Exec_Master_Log_Pos ?
Step 05 : If any one of the four conditions in Step 04 is not met
- On M1 and M2, START SLAVE SQL_THREAD;
- SELECT SLEEP(30);
- Go Back to Step 01

If you get past Step 05 with all four conditions in Step 04, M1 and M2 are in sync.

Once M1 and M2 are frozen simultaneously

S1 should match M1
- Wait until S1's Seconds_Behind_Master = 0
- M1's File = S1's Relay_Master_Log_File
- M1's Position = S1's Exec_Master_Log_Pos
S2 should match M2
- Wait until S2's Seconds_Behind_Master = 0
- M2's File = S2's Relay_Master_Log_File
- M2's Position = S2's Exec_Master_Log_Pos
No need to run STOP SLAVE; on S1 or S2

I hope this helps

UPDATE 2012-05-11 17:30 EDT

Once S1 and S2 match up with their respective Master, you could STOP SLAVE; if you want to. Since M1 and M2 are frozen, no other changes can reach S1 or S2. Thus, STOP SLAVE; is not a requirement but you do so anyway.

UPDATE 2012-05-11 21:29 EDT

Your Comment

M1/M2 are frozen from receiving updates from one another but not from receiving a legit update from an external client/application, no?

Are you still accepting incoming feeds? You did say in the original question

As I try thinking this out I keep running into gotchas that won't quite work out.

That would certainly be one gotcha. Therefore, discontinue incoming feeds.

Since you want to do FLUSH TABLES WITH READ LOCK; to M1 and M2, I have one recommendation. Please set this one hour before syncing everything:

SET GLOBAL innodb_max_dirty_pages_pct = 0;

This will clear all dirty pages from the InnoDB Buffer Pool. That way, the time for FLUSH TABLES WITH READ LOCK; is as fast as possible. When all syncing is done, set it back to 90 (if running MySQL 5.5) or 75 (otherwise).

Your Comment

I could see how M1/M2 were locked if they flushed w/ read lock but it seemed your steps were not including such a step

I was not including such a step because I was under the impression you would disable outside feeds.

Best Answer

Related Solutions

Mysql – Adding a new slave slowing down previous replicas

Mysql – Getting slaves of a master-master setup stopped in sync

UPDATE 2012-05-11 17:30 EDT

UPDATE 2012-05-11 21:29 EDT

Your Comment

Your Comment

Related Question