Mysql – Relay log read fail, relay or master log currupt ,How to repair

group-replicationmysql-8.0mysql-innodb-clusterreplication

I am running mysql InnoDB cluster 8.0.17(group replication)
During the repication the slave disk gets full and no space left at all the thread waits for the disk space to be freed at the same time this was happening the server restarted once server came back mysql error log shows relay log is corrupt and slave cannot rejoin.
my question is How can I repair the slave?

error log
Disk is full writing ‘./ic-1-relay-bin-group_replication_applier.000039’ (OS errno 28 – No space left on device). Waiting for someone to free space… Retry in 60 secs. Message reprinted in 600 secs.
2019-08-21T19:02:37.095609Z 33 [ERROR] [MY-010584] [Repl] Slave SQL for channel ‘group_replication_applier’: Could not execute Write_rows event on table sbtest.sbtest3; Error writing file ‘/tmp/MLfd=28’ (OS errno 22019-08-22T06:55:36.766622Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.17) starting as process 910
2019-08-22T06:55:43.510632Z 1 [Warning] [MY-012637] [InnoDB] 81920 bytes should have been written. Only 77824 bytes written. Retrying for the remaining bytes.
2019-08-22T06:55:43.510669Z 1 [Warning] [MY-012638] [InnoDB] Retry attempts for writing partial data failed.
2019-08-22T06:55:43.510683Z 1 [ERROR] [MY-012639] [InnoDB] Write to file ./#innodb_temp/temp_6.ibt failed at offset 0, 81920 bytes should have been written, only 77824 were written. Operating system error number 28. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.
….
Error number 28 means ‘No space left on device’
2019-08-22T07:02:37.278732Z 1 [ERROR] [MY-012267] [InnoDB] Could not set the file size of ‘./ibtmp1’. Probably out of disk space
2019-08-22T07:02:37.278746Z 1 [ERROR] [MY-012926] [InnoDB] Unable to create the shared innodb_temporary.
2019-08-22T07:02:37.278764Z 1 [ERROR] [MY-012930] [InnoDB] Plugin initialization aborted with error Generic error.
2019-08-22T07:02:37.776328Z 1 [ERROR] [MY-010334] [Server] Failed to initialize DD Storage Engine
2019-08-22T07:02:37.776504Z 0 [ERROR] [MY-010020] [Server] Data Dictionary initialization failed.
2019-08-22T07:02:37.779913Z 0 [ERROR] [MY-010119] [Server] Aborting
…
[MY-010818] [Server] Error reading GTIDs from relaylog: -1
Slave SQL for channel ‘group_replication_applier’: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master’s binary log is corrupted (you can check this by running ‘mysqlbinlog’ on the binary log), the slave’s relay log is corrupted (you can check this by running ‘mysqlbinlog’ on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master’s or slave’s MySQL code. If you want to check the master’s binary log or slave’s relay log, you will be able to know their names by issuing ‘SHOW SLAVE STATUS’ on this slave. Error_code: MY-013121

Best Answer

After searching for sometime I found a reason and solution.

By default the relay_log_purge variable was on and the relay_log_recovery was off in mysql configuration and both values are default. As per the mysql documentation the --relay-log-recovery option must be enabled on the slave to guarantee resilience in the event of an unexpected server halt. so in my case due to no disk space server restarted and relay log corrupt error message came now

SOLUTION:

first remove the node from group

do this on the removed slave node

STOP GROUP REPLICATION

RESET SLAVE

rejoin node to the cluster .It will work.

Explanation: RESET SLAVE makes the slave forget its replication position in the master's binary log deletes all the relay log files, and starts a new relay log file.

To use RESET SLAVE on a Group Replication group member, the member status must be OFFLINE, meaning that the plugin is loaded but the member does not currently belong to any group. A group member can be taken offline by using a STOP GROUP REPLICATION statement

For a server where GTIDs are in use (gtid_mode is ON), issuing RESET SLAVE has no effect on the GTID execution history. The statement does not change the values of gtid_executed or gtid_purged

Related Solutions

Mysql – Looking for an efficient way to fix “Could not parse relay log event entry…” error

There is more stable approach you can try

Here is something to remember

Whenever you run CHANGE MASTER TO, it will erase every relay log you have. You do not want to keep relay logs of commands you have not executed any SQL on as of yet

The following is an excerpt taken from a post I made back on Feb 03, 2012 : How to resolve the master server shut down/unavailability in mysql with master - slave replication :

Please notice that there are two sets of replication coordinates from the Master

(Master_Log_File,Read_Master_Log_Pos)

(Relay_Master_Log_File,Exec_Master_Log_Pos)

There is a major difference between them

(Master_Log_File,Read_Master_Log_Pos) tells you the last binlog statement from the Master's log file and log position that the Slave read from the Master and placed in its Relay Logs.

(Relay_Master_Log_File,Exec_Master_Log_Pos) tells you the last binlog statement from the Master's log file and log position that the Slave read from the Master and placed in its Relay Logs THAT IS NEXT TO BE EXECUTED ON THE SLAVE.

What you want are two things:

Erase Every Binary Log You Have
Start Collecting Binary Log Entries From the Last SQL You Successfully Executed.

In your case, you must use the second set of Replication Coordinates

Relay_Master_Log_File
Exec_Master_Log_Pos

It is easy to distrust a corrupt relay log as shown in the error message. The one that hurts the most is a corrupt Master Log. You will have to jump through hoops if that is the case. On the other hand, if one of the other situations was the reason for the corrupt relay log, the simplest and most concise approach is what I stated.

To make sure, whatever is reported for Relay_Master_Log_File, if that particular binary log still exists on the Master, perform a mysqlbinlog on it. If it dumps in its entirety without corrupt characters, go ahead and use the second set of replication coordinates.

From my same earlier post

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 10.48.20.253
                Master_User: replicant
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysql-bin.000254
        Read_Master_Log_Pos: 858190247
             Relay_Log_File: relay-bin.066069
              Relay_Log_Pos: 873918
      Relay_Master_Log_File: mysql-bin.000254
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 858190247
            Relay_Log_Space: 873772
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 0
1 row in set (0.00 sec)

notice that the Replication Coordinates from SHOW SLAVE STATUS\G for what was last executed are (mysql-bin.000254,858190247). The CHANGE MASTER TO command in this case would be:

CHANGE MASTER TO master_log_file='mysql-bin.000254',master_log_pos=858190247;

Give it a Try !!!

UPDATE 2012-09-14 16:38 EDT

If you worried about the stockpiling relay logs, just throttle the relay logs. In SHOW SLAVE STATUS\G, there is a field called Relay_Log_Space. That gives you the sum of all relay sizes in bytes. Did you know you could put a cap on that number ?

The option is called relay_log_space_limit.

For example, if you want to cap the total number of bytes to 10G, do the following

STEP 01) Add this to /etc/my.cnf on the Slave

[mysqld]
relay_log_space_limit = 10G

STEP 02) Run service mysql restart on the Slave

and that's it !!!

When the oldest relay has all its entries processed, it is deleted and a new relay log is created. That gets filled until all relay logs add up to 10G. That's the only way to control runaway relay log space issues.

UPDATE 2012-09-14 18:10 EDT

SUGGESTION : If you make mysqldump backups of the data on the Slave every midnight, you could set up the following to restrict having 1TB of binary logs:

STEP 01) Add this to /etc/my.cnf on the Master

[mysqld]
expire_logs_days = 14

STEP 02) Run this query on the Master

mysql> PURGE BINARY LOGS BEFORE DATE(NOW()) - INTERVAL 14 DAY;

STEP 03) service mysql restart on the Master

STEP 04) Add a mysqldump backup script to a crontab on the Slave

This will make the Slave more useful and would control having excess binary logs to worry about

MySQL slave stuck in “Reading event from the relay log”

If you are running LOAD DATA LOCAL INFILE on a Master every 5 minutes, how does that replicate ?

Back on Jan 22, 2012, I wrote an answer to this post : MySql shell command not replicated to slave

In my answer, I explained how LOAD DATA LOCAL INFILE gets replicated:

Master
1. Executes LOAD DATA LOCAL INFILE
2. Copies the contents of entire text file used into the binary logs
3. The SQL command LOAD DATA LOCAL INFILE is appended to the latest binary log
Replication ships all of this from the Master's Binary Logs to the Slave's Relay Logs
Slave
1. Sees the Text File in the Relay Logs
2. Reads all the blocks from multiple Relay Logs
3. The text file get materialized in the /tmp folder
4. Reads LOAD DATA LOCAL INFILE command from Relay Log
5. Executes LOAD DATA LOCAL INFILE in the SQL Thread

During Steps 1-4 on the Slave, the IO Thread would have to be in the Reading event from the relay log state, constantly extracting the CSV needed for the next LOAD DATA LOCAL INFILE command. Sometimes, this causes Seconds_Behind_Master to leap hundreds or even thousands of seconds at a time without warning after staying relatively idle.

Running STOP SLAVE; does not help at all. The reason STOP SLAVE; hangs? Steps 1-4 of the Slave will lock the IO Thread until the complete extraction of the CSV file. Even when the extraction has completed, there is the LOAD DATA LOCAL INFILE itself. Running STOP SLAVE; will lock on the SQL Thread running LOAD DATA LOCAL INFILE.

In this crazy paradigm, Replication Lag has to increase steadily. Just look at your Relay_Log_Space. It is 5708184440 (like 5.3 GB). There are multiple LOAD DATA LOCAL INFILE commands just waiting to execute.

Look at the how Seconds_Behind Master increases

LOAD DATA LOCAL INFILE is executed on the Master
The CSV needed in loaded into Binary Logs
The CSV needed in unloaded from Relay Logs
LOAD DATA LOCAL INFILE is executed on the Slave

If LOAD DATA LOCAL INFILE takes 2 minutes, double that number and add the taken to ship the CSV file through the MySQL Replication process. You may need to come up with a different method of loading data that does not use LOAD DATA LOCAL INFILE.

UPDATE 2013-04-05 15:00 EDT

If MySQL Replication continues falling behind (Seconds_Behind_Master keeps increasing) while the Master keeps getting log-jammed with small LOAD DATA LOCAL INFILE commands, there is only one more thing I could suggest: Most people never touch sync_binlog, which is normally zero. What effect can this have on replicating LOAD DATA LOCAL INFILE ?

According to the Documentation on sync_binlog, this setting can be used to flush binlog changes to disk. Since it is 0 by default, your Master DB Server is at the mercy of the OS because OS is responsible for flusihng binlog changes. When you set sync_binlog to 1, everything may actually change for better. How?

Here is what is probably happening when sync_binlog is 0 on a Master:

You run LOAD DATA LOCAL INFILE
mysqld on Master writes the command to the binlog
mysqld on Master writes the entire CSV file into the binlogs
mysqld on Master leaves it to the OS to flush binlogs changes
Slave reads all binlogs info from Master except the last binlog that the Master's OS did not flush
Slave status shows it is trying to retrieve the remaining info

This is how sync_binlog can hopefully improve things:

You run SET GLOBAL sync_binlog = 1;
You run LOAD DATA LOCAL INFILE
You run SET GLOBAL sync_binlog = 0;
mysqld on Master writes the command to the binlog
mysqld on Master writes the entire CSV file into the binlogs
mysqld on Master flushes every write to the binlogs because sync_binlog = 1
Slave reads all binlogs info from Master
Slave status should show it has read every needed binlog

Give it a Try !!!

UPDATE 2013-04-09 11:23 EDT

If you have a low-to-moderate amount of writes (INSERT, UPDATE, DELETE, and ALTER TABLE) in the Master DB Server, leaving sync_binlog at 1 may not be a bad idea. You would then need to do the following:

STEP 01) On the Slave, run STOP SLAVE;

STEP 02) On the Master, add sync_binlog=1 to /etc/my.cnf:

[mysqld]
sync-binlog=1

STEP 03) Run one of the following on the Master:

SET GLOBAL sync_binlog = 1;

service mysql restart

STEP 04) On the Slave, run START SLAVE;

Give it a Try !!!

Best Answer

Related Solutions

Mysql – Looking for an efficient way to fix “Could not parse relay log event entry…” error

UPDATE 2012-09-14 16:38 EDT

UPDATE 2012-09-14 18:10 EDT

MySQL slave stuck in “Reading event from the relay log”

UPDATE 2013-04-05 15:00 EDT

UPDATE 2013-04-09 11:23 EDT

Related Question