MySQL replication stops everyday for no (obvious) reasons

centoslinuxMySQLreplication

i've got a problem with my MySQL master-slave replication.
It works just fine after i start it up but around 2:30am the next day it stops. Monitoring shows that the slave starts continously reading until it is manually (forcefully) stopped and restarted. Neither the mysqld.log nor the mysql_general.log show any errors and show slave status \G is also clear of any error messages (it just shows the seconds behind master increasing as to be expected).

The setup is using row based replication and is running on MySQL Community Server 5.54.

I've checked all crontabs for any reoccuring events but it's all clear. Now unless there is another way for timed jobs to be triggered i'm out of ideas here tbh.

Both master and slave are identical in terms of setup (both are CentOS6 VMs with 4 cores and 7.5GB RAM) neither master nor slave are experiencing any load peaks around the time the issue appears. The only other thing i did notice was that the disk latency spiked as soon as the slave started reading but since it seems to be proportional to the reads/s graph i'll attribute it to that.

Disk performance shouldn't be an issue either since both are on dedicated storage systems and were (until 2 weeks ago) on the same storage system (a IBM V7000).

Edit:
There are several indicators that replication stopped and is indeed not just lagging behind. First of all the obvious increase in "Seconds Behind Master" but then there's also the lack of further entries in mysql_general.log and more importantly (and maybe a little more subtle) during the day, after getting it running again, there is a constant amount of writes (visible in our monitoring graphs). At more or less precisely 2:30am this just blatantly stops and turns into reads (interestingly enough this is the same on the master, though i haven't managed to find any good reason for that) and that's also when last commit event is being logged in the mysql_general.log

It also doesn't just start again with slave start, instead the server has to be forcefully stopped as the mysql sees the slave process as still running. After restarting the server it is also not budging unless replication is started with sql_slave_skip_counter=1 – i know this is less than ideal and regardless of what happens i'll have to do a data integrity test at some point to verify that it isn't just messing up somewhere internally (which i wouldn't be surprised by).

Slave_IO_State: Queueing master event to the relay log
              Master_Host: <MasterIP>
              Master_User: <Replication User>
              Master_Port: 3306
            Connect_Retry: 60
          Master_Log_File: mysql-bin.000879
      Read_Master_Log_Pos: 628745545
           Relay_Log_File: mysqld-relay-bin.001364
            Relay_Log_Pos: 443942
    Relay_Master_Log_File: mysql-bin.000879
         Slave_IO_Running: Yes
        Slave_SQL_Running: Yes
          Replicate_Do_DB:
      Replicate_Ignore_DB:
       Replicate_Do_Table:
   Replicate_Ignore_Table:
  Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
               Last_Errno: 0
               Last_Error:
             Skip_Counter: 0
      Exec_Master_Log_Pos: 520752778
          Relay_Log_Space: 108436866
          Until_Condition: None
           Until_Log_File:
            Until_Log_Pos: 0
       Master_SSL_Allowed: No
       Master_SSL_CA_File:
       Master_SSL_CA_Path:
          Master_SSL_Cert:
        Master_SSL_Cipher:
           Master_SSL_Key:
    Seconds_Behind_Master: 21194
Master_SSL_Verify_Server_Cert: No
            Last_IO_Errno: 0
            Last_IO_Error:
           Last_SQL_Errno: 0
           Last_SQL_Error:
Replicate_Ignore_Server_Ids:
         Master_Server_Id: 1

Edit2:
Alright i checked the binary log to see what the master tried to do before and at the statement where it seems to be stuck, using mysqlbinlog, this is the position where it stopped today: 757885512

# at 757885512
#170405  2:00:02 server id 1  end_log_pos 757885592     Query  thread_id=13818268      exec_time=1921  error_code=0
SET TIMESTAMP=1491350402/*!*/;
BEGIN
/*!*/;

And then there's just a bunch of # at increasing position number

I've checked it again today and this time i've checked the statement leading up to the one i've posted above and it just seems like a regular update to a session table that basically looks like this:

Update 'myDB'.'session'
WHERE
@1=<some number>
@2=<some string>
@3=<some other number>
*A few more lines like the one above with some being NULL*
SET
@1=<some number>
@2=<some string>
@3=<some different other number>
*All the numbers again apart from one that's being incremented by 1*

Judging by the data that i can see here it just looks like some counter for a session with the numbers in the above mentioned fields being UserID, SessionID and so on.

Edit3:

Here's the create table for one of the tables where the replication seems to have been stuck at

   Table: session
Create Table: CREATE TABLE 'session' (
'id' int(11) NOT NULL AUTO_INCREMENT,
'session_id' varchar(256) DEFAULT NULL,
'user_id' int(11) DEFAULT NULL,
'current_page' int(11) DEFAULT NULL,
'last_reload' int(11) DEFAULT NULL,
'ip_address' varchar(45) DEFAULT NULL,
's_nbHostsUp' int(11) DEFAULT NULL,
's_nbHostsDown' int(11) DEFAULT NULL,
's_nbHostsUnreachable' int(11) DEFAULT NULL,
's_nbHostsPending' int(11) DEFAULT NULL,
's_nbServicesOk' int(11) DEFAULT NULL,
's_nbServicesWarning' int(11) DEFAULT NULL,
's_nbServicesCritical' int(11) DEFAULT NULL,
's_nbServicesPending' int(11) DEFAULT NULL,
's_nbServicesUnknown' int(11) DEFAULT NULL,
'update_acl' enum('0','1') DEFAULT '0',
 PRIMARY KEY ('id'),
 KEY 'session_id' ('session_id'(255)),
 KEY 'user_id' ('user_id')
 ) ENGINE=InnoDB AUTO_INCREMENT=13493 DEFAULT CHARSET=utf8

Also i have to add – i wrote below the engine is MyIsam – it is indeed actually a mix of InnoDB and MyIsam with the MyIsam tables making out the part of the DB that gets changed the most including the biggest table.

Edit4:

Todays entry where the system's stuck looks pretty much the same, it is just after an update to the session table, however (maybe by pure accident) i stumbled upon something:

#170408  2:35:05 server id 1  end_log_pos 815569300     Query   thread_id=38771 exec_time=10    error_code=0
SET TIMESTAMP=1491611705/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
BEGIN
/*!*/;
# at <some number>
*some more # at <some number> entries later*
#170408  2:35:05 server id 1  end_log_pos 815569365     Table_map: 'application_storage'.'data_bin' mapped to number 296
#170408  2:35:05 server id 1  end_log_pos 815570402     Write_rows: table id 296
#170408  2:35:05 server id 1  end_log_pos 815571439     Write_rows: table id 296
#170408  2:35:05 server id 1  end_log_pos 815572476     Write_rows: table id 296

Now the data_bin table is a huge ass MyIsam table, around 20-23GB in size (huge compared to all the other tables making up 4/5 of the DBs overall size)

Edit5:

Something else i noticed today when i logged into the DB to check the replication status: It seems like the MySQL was stopped at some point (or my session timed out somehow at least) as i had left a SSH session open where i was already logged in and when i executed a show slave status \G it told me that the server had "gone away" aka was stopped/crashed but also apparently restarted as it managed to reconnect. MySQL Error log doesn't show any crashes or restarts though, which is interesting.

Best Answer

Check what is running at that time/ or at that bin-lgo position. you can check via bin-log events and can trace what statement is running at that time. one of the reason could be there may be a huge operation running which has to deal with lots of records or not necessarily huge but query/statement is not using index properly. this is one of the reason where you may experience below (same in your case if I understood it correctly)

Exec_Master_Log_Pos will not move further
second behind master will keep increasing (coz of above #1)
you will not be able to stop slave
while running stop slave it will hung and will not complete the execution
you might no be able to stop the mysql gracefully or properly using stop command

I suspect this is related to untuned query which is killing your slave try to check what is the query at Exec_Master_Log_Pos this position

Does it have proper primary key?
Query is using proper index?
Which engine it is using?

Hope this helps.

OPTION 1

Skip the error, wait 5 seconds, and view the Slave Status. Here the 5 steps for Skipping an Error

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;
SELECT SLEEP(5);
SHOW SLAVE STATUS\G

When you view the Slave Status, here is what to expect

If Seconds_Behind_Master is NULL
- Replication is Broken : Look for Tell-Tale Signs
- If Error Number is 1062 again, Repeat the 5 steps for Skipping an Error
If Seconds_Behind_Master is a Number
- Replication is running
- When Seconds_Behind_Master > 0, Replication is Catching Up.
- When Seconds_Behind_Master = 0, Replication is Fully Caught Up.

OPTION 2

Remove the row to allow replication to continue

Delete the row from the table on the Slave and do the following 4 Steps for Skipping an Error:

STOP SLAVE;
START SLAVE;
SELECT SLEEP(5);
SHOW SLAVE STATUS\G

At the risk of sounding redundant...

When you view the Slave Status, here is what to expect

If Seconds_Behind_Master is NULL
- Replication is Broken : Look for Tell-Tale Signs
- If Error Number is 1062 again, delete the row Repeat the 4 steps for Skipping an Error
If Seconds_Behind_Master is a Number
- Replication is running
- When Seconds_Behind_Master > 0, Replication is Catching Up.
- When Seconds_Behind_Master = 0, Replication is Fully Caught Up.

What if there are just too many duplicate key issues? Here are some of my earlier posts concerning how to use MAATKIT's mk-table-checksum, mk-table-sync, pt-table-checksum, pt-table-sync:

Mysql – Replication error

I suspect that some of this is stating the obvious. I apologize for that but wanted to include it in the interest of thoroughness.

The slave stopping in this condition is the expected behavior.

Every query that gets written to the binlog has metadata that includes the error code that the query returned on the server that generated the binlog event. The error code is normally 0 for "no error." You can see this error code and the other metadata if you run the log through mysqlbinlog.

#130727 10:52:46 server id 10 end_log_pos 166 Query thread_id=673471 exec_time=1 error_code=0

If an error occurs after the query reaches a certain point in execution, the query still gets written to the binlog along with the error code.

The slave expects to encounter the same error that the master encountered (usually but not always 0) and stops when the values don't match to help avoid data inconsistencies that could otherwise result if replication was allowed to continue after such a condition was encountered.

The binlog data structures don't appear to have a place for the error message or the values that were sprintf()'ed into the message -- just the code -- so the slave has to translate that code into something human-readable. Lacking the values that the master would have plugged in to the message, it just displays the raw message format, including the placeholders.

# from source file include/mysqld_ername.h
{ "ER_NOT_KEYFILE", 1034, "Incorrect key file for table \'%-.200s\'; try to repair it" },

So what you're seeing as the last replication error on the slave is normal behavior when an error occurs on the master but the same query doesn't generate an error on the slave.

You can skip over the error with SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; as you have probably already done by now, but recovery beyond that depends on what you did with the results from that temporary table. If the temporary table was created and populated by a single statement-based event, and wasn't used to do any kind of further table updates, you might be fine, because replication typically ignores errors that occur when trying to DROP TEMPORARY TABLE and the temporary table doesn't exist on the slave, since the slave thread has no reliable way of knowing whether the CREATE TEMPORARY TABLE statement actually appeared previously in the binlog or whether the slave might have stopped and started between the CREATE and DROP events on the temporary table (which would have destroyed any temporary tables that had been created by the slave SQL thread before the stop/start). On the other hand, if the temporary table was used to update other tables, you still might be fine because those updates might have been written as row-based events since you're in MIXED binlog format... but I would verify consistency among any tables you updated from the temporary table results, if there are any.

As for the errors that occurred on the master:

130725 23:15:57 [Warning] Warning: Enabling keys got errno 120 on shared.tmp_grades, retrying
130726 23:15:58 [Warning] Warning: Enabling keys got errno 137 on shared.tmp_grades, retrying

There is something I missed about these at first -- I thought they were 1 second apart, but these errors actually appear to have been on two different days, the 25th and 26th, so 24 hours and 1 second apart... 2 different warnings on the same temporary table name but in fact two different temporary tables generated a day apart.

MySQL error code 120: Didn't find key on read or update (HA_ERR_KEY_NOT_FOUND)
MySQL error code 137: No more records (read after end of file) (HA_ERR_END_OF_FILE)

At first I thought it was strange that you had no "Incorrect key file" messages in the master's log, but those wouldn't be in the log -- those would be errors returned to the client that issued the query.

I remember having seen errors like that in my servers' error logs, but it turns out, from what I can tell, those "Incorrect key file" errors only go to the server error log if they are encountered by the slave_sql thread or by a query executed by the event scheduler, since those are the only times there's not a client connection to receive the error... so it could be that whatever is running the query that creates the temporary table has generated a log of errors that might be of interest.

We can't technically assume that it was the 'tmp_grades' table that encountered the "Incorrect key file" error if the SELECT portion of the query created its own implicit temporary table... it could have been that table instead, causing an error that affected the query.

Looking into the internals, a bit, it looks like when you create a MyISAM temporary table with indexes in a CREATE TEMPORARY TABLE ... SELECT statement, the table is created with the indexes disabled and then the server builds and enables the indexes after the data is written rather than having to update indexes as it inserts data... and it appears that the warnings you're seeing may have been at the point when it builds the indexes ("Enabling keys...")... and assuming that's correct, it suggests only a handful of possibilities, listed in no particular order, but all of which revolve around the idea that your temporary table is getting corrupted:

You have another job running at the same time that also creates a massive temporary table either explicitly or implicity (Using temporary) so you have a transient low disk space condition that disappears immediately after the other job finishes
The SELECT part of the query that's populating the temporary table is building another implicit temporary table that's causing a transient low disk space condition and competing with the explicit table for space
You have a problem with your system memory, so the temporary table is being corrupted in the OS cache
You have a problem with your disk and the temporary table is too large to remain in the OS cache, the temporary table is being corrupted when it's flushed to disk.

Of course, the biggest problem here is that none of these explanations seems particularly likely. After verifying that the data between the two machines is currently consistent, the unfortunate next step may be to wait for it to happen again though I would be inclined to verify the system memory and the integrity of the temporary disk if it were me.

Best Answer

Related Solutions

MySQL Replication Error

OPTION 1

OPTION 2

Mysql – Replication error

Related Question