MySQL replication stops everyday for no (obvious) reasons

centoslinuxMySQLreplication

i've got a problem with my MySQL master-slave replication.
It works just fine after i start it up but around 2:30am the next day it stops. Monitoring shows that the slave starts continously reading until it is manually (forcefully) stopped and restarted. Neither the mysqld.log nor the mysql_general.log show any errors and show slave status \G is also clear of any error messages (it just shows the seconds behind master increasing as to be expected).

The setup is using row based replication and is running on MySQL Community Server 5.54.

I've checked all crontabs for any reoccuring events but it's all clear. Now unless there is another way for timed jobs to be triggered i'm out of ideas here tbh.

Both master and slave are identical in terms of setup (both are CentOS6 VMs with 4 cores and 7.5GB RAM) neither master nor slave are experiencing any load peaks around the time the issue appears. The only other thing i did notice was that the disk latency spiked as soon as the slave started reading but since it seems to be proportional to the reads/s graph i'll attribute it to that.

Disk performance shouldn't be an issue either since both are on dedicated storage systems and were (until 2 weeks ago) on the same storage system (a IBM V7000).

Edit:
There are several indicators that replication stopped and is indeed not just lagging behind. First of all the obvious increase in "Seconds Behind Master" but then there's also the lack of further entries in mysql_general.log and more importantly (and maybe a little more subtle) during the day, after getting it running again, there is a constant amount of writes (visible in our monitoring graphs). At more or less precisely 2:30am this just blatantly stops and turns into reads (interestingly enough this is the same on the master, though i haven't managed to find any good reason for that) and that's also when last commit event is being logged in the mysql_general.log

It also doesn't just start again with slave start, instead the server has to be forcefully stopped as the mysql sees the slave process as still running. After restarting the server it is also not budging unless replication is started with sql_slave_skip_counter=1 – i know this is less than ideal and regardless of what happens i'll have to do a data integrity test at some point to verify that it isn't just messing up somewhere internally (which i wouldn't be surprised by).

Slave_IO_State: Queueing master event to the relay log
              Master_Host: <MasterIP>
              Master_User: <Replication User>
              Master_Port: 3306
            Connect_Retry: 60
          Master_Log_File: mysql-bin.000879
      Read_Master_Log_Pos: 628745545
           Relay_Log_File: mysqld-relay-bin.001364
            Relay_Log_Pos: 443942
    Relay_Master_Log_File: mysql-bin.000879
         Slave_IO_Running: Yes
        Slave_SQL_Running: Yes
          Replicate_Do_DB:
      Replicate_Ignore_DB:
       Replicate_Do_Table:
   Replicate_Ignore_Table:
  Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
               Last_Errno: 0
               Last_Error:
             Skip_Counter: 0
      Exec_Master_Log_Pos: 520752778
          Relay_Log_Space: 108436866
          Until_Condition: None
           Until_Log_File:
            Until_Log_Pos: 0
       Master_SSL_Allowed: No
       Master_SSL_CA_File:
       Master_SSL_CA_Path:
          Master_SSL_Cert:
        Master_SSL_Cipher:
           Master_SSL_Key:
    Seconds_Behind_Master: 21194
Master_SSL_Verify_Server_Cert: No
            Last_IO_Errno: 0
            Last_IO_Error:
           Last_SQL_Errno: 0
           Last_SQL_Error:
Replicate_Ignore_Server_Ids:
         Master_Server_Id: 1

Edit2:
Alright i checked the binary log to see what the master tried to do before and at the statement where it seems to be stuck, using mysqlbinlog, this is the position where it stopped today: 757885512

# at 757885512
#170405  2:00:02 server id 1  end_log_pos 757885592     Query  thread_id=13818268      exec_time=1921  error_code=0
SET TIMESTAMP=1491350402/*!*/;
BEGIN
/*!*/;

And then there's just a bunch of # at increasing position number

I've checked it again today and this time i've checked the statement leading up to the one i've posted above and it just seems like a regular update to a session table that basically looks like this:

Update 'myDB'.'session'
WHERE
@1=<some number>
@2=<some string>
@3=<some other number>
*A few more lines like the one above with some being NULL*
SET
@1=<some number>
@2=<some string>
@3=<some different other number>
*All the numbers again apart from one that's being incremented by 1*

Judging by the data that i can see here it just looks like some counter for a session with the numbers in the above mentioned fields being UserID, SessionID and so on.

Edit3:

Here's the create table for one of the tables where the replication seems to have been stuck at

   Table: session
Create Table: CREATE TABLE 'session' (
'id' int(11) NOT NULL AUTO_INCREMENT,
'session_id' varchar(256) DEFAULT NULL,
'user_id' int(11) DEFAULT NULL,
'current_page' int(11) DEFAULT NULL,
'last_reload' int(11) DEFAULT NULL,
'ip_address' varchar(45) DEFAULT NULL,
's_nbHostsUp' int(11) DEFAULT NULL,
's_nbHostsDown' int(11) DEFAULT NULL,
's_nbHostsUnreachable' int(11) DEFAULT NULL,
's_nbHostsPending' int(11) DEFAULT NULL,
's_nbServicesOk' int(11) DEFAULT NULL,
's_nbServicesWarning' int(11) DEFAULT NULL,
's_nbServicesCritical' int(11) DEFAULT NULL,
's_nbServicesPending' int(11) DEFAULT NULL,
's_nbServicesUnknown' int(11) DEFAULT NULL,
'update_acl' enum('0','1') DEFAULT '0',
 PRIMARY KEY ('id'),
 KEY 'session_id' ('session_id'(255)),
 KEY 'user_id' ('user_id')
 ) ENGINE=InnoDB AUTO_INCREMENT=13493 DEFAULT CHARSET=utf8

Also i have to add – i wrote below the engine is MyIsam – it is indeed actually a mix of InnoDB and MyIsam with the MyIsam tables making out the part of the DB that gets changed the most including the biggest table.

Edit4:

Todays entry where the system's stuck looks pretty much the same, it is just after an update to the session table, however (maybe by pure accident) i stumbled upon something:

#170408  2:35:05 server id 1  end_log_pos 815569300     Query   thread_id=38771 exec_time=10    error_code=0
SET TIMESTAMP=1491611705/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
BEGIN
/*!*/;
# at <some number>
*some more # at <some number> entries later*
#170408  2:35:05 server id 1  end_log_pos 815569365     Table_map: 'application_storage'.'data_bin' mapped to number 296
#170408  2:35:05 server id 1  end_log_pos 815570402     Write_rows: table id 296
#170408  2:35:05 server id 1  end_log_pos 815571439     Write_rows: table id 296
#170408  2:35:05 server id 1  end_log_pos 815572476     Write_rows: table id 296

Now the data_bin table is a huge ass MyIsam table, around 20-23GB in size (huge compared to all the other tables making up 4/5 of the DBs overall size)

Edit5:

Something else i noticed today when i logged into the DB to check the replication status: It seems like the MySQL was stopped at some point (or my session timed out somehow at least) as i had left a SSH session open where i was already logged in and when i executed a show slave status \G it told me that the server had "gone away" aka was stopped/crashed but also apparently restarted as it managed to reconnect. MySQL Error log doesn't show any crashes or restarts though, which is interesting.

Best Answer

Check what is running at that time/ or at that bin-lgo position. you can check via bin-log events and can trace what statement is running at that time. one of the reason could be there may be a huge operation running which has to deal with lots of records or not necessarily huge but query/statement is not using index properly. this is one of the reason where you may experience below (same in your case if I understood it correctly)

  1. Exec_Master_Log_Pos will not move further
  2. second behind master will keep increasing (coz of above #1)
  3. you will not be able to stop slave
  4. while running stop slave it will hung and will not complete the execution
  5. you might no be able to stop the mysql gracefully or properly using stop command

I suspect this is related to untuned query which is killing your slave try to check what is the query at Exec_Master_Log_Pos this position

  1. Does it have proper primary key?
  2. Query is using proper index?
  3. Which engine it is using?

Hope this helps.