MySQL replication duplicate key errors after error 1594

MySQLreplication

So I had this happen to me recently:

mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 198.101.212.103
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql-bin.000103
          Read_Master_Log_Pos: 30051137
               Relay_Log_File: mysqld-relay-bin.000095
                Relay_Log_Pos: 23511387
        Relay_Master_Log_File: mysql-bin.000095
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1594
                   Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 23511241
              Relay_Log_Space: 477287485
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1594
               Last_SQL_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1

So I looked at master:

mysql> show master status;
+------------------+----------+--------------+------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.000103 | 30007591 | xxx_l       |                  |
+------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

Everything fine. So I ran:

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> reset slave;
Query OK, 0 rows affected (0.09 sec)

mysql> change master to master_log_file='mysql-bin.000095', master_log_pos=23511241;
Query OK, 0 rows affected (0.02 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

As suggested by everyone and now I run into:

mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 198.101.212.103
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql-bin.000103
          Read_Master_Log_Pos: 30763976
               Relay_Log_File: mysqld-relay-bin.000003
                Relay_Log_Pos: 58516
        Relay_Master_Log_File: mysql-bin.000091
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1062
                   Last_Error: Error 'Duplicate entry '128776' for key 'PRIMARY'' on query. Default database: 'xxx_l'. Query: 'INSERT INTO images ...'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 58370
              Relay_Log_Space: 684452249
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1062
               Last_SQL_Error: Error 'Duplicate entry '128776' for key 'PRIMARY'' on query. Default database: 'xxx_l'. Query: 'INSERT INTO images ...'
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
1 row in set (0.00 sec)

Obviously my replication is now broken beyound repair. Considering every answer says that it is safe to replicate from the last master position why did this not work?

As a side note: if I do not have to reimage let me know.

Best Answer

How I would do it, without having to reload all data:

  1. Take the slave out of production- it has drifted and has wrong data
  2. Skip replication errors until the replication is running again (but remember, with wrong data) with slave-skip-errors. Not always possible.
  3. Use pt-table-checksum to identify the master-slave differences
  4. Use pt-table-sync on the particular table(s) to fix the errors. Both of those tools are harder to use when using filtering.
  5. You now have a consistent replication (for a while - you should identify the underlying problem that cause it at first)

Considering every answer says that it is safe to replicate from the last master position why did this not work?

This is not a sane MySQL advice. Full stop.

I cannot say for sure, but if that when you have filtering, my experience says that that is the 90% of the root cause for replication problems. There are many reasons to not use master-side filtering while having the same results (replication breaking, difficult to synchronize them, impossible to promote a slave as master, etc.). Other options include the loss or corruption of the relay logs.