MySQL RDS Read Replica – How to Fix Replication Error

MySQL

I use amazon rds and have a cross region replica. While I was doing some schema changes my read replica gave me this error and replication stopped.

Message : Read Replica Replication Error – SQLError: 1146, reason:
Error Table phppoint_koreannoona.phppos_items doesn't exist on query.
Default database: DB_NAME;. Query: ALTER TABLE phppos_items DROP
COLUMN aaatex_qb_item_name

What would cause this? The table is in the master. My migration DID remove that field but entire table was missing in slave.

I had to rebuild slave.

Best Answer

What would cause this?

Let's go into facts, then we will later have to go into speculation:

Table phppoint_koreannoona.phppos_items doesn't exist on query

You executed an ALTER TABLE query on a table that didn't exist. Your table existed on the master at the time the alter was ran (otherwise, it wouldn't have gone through the binlog and tried to be executed on the replica), but didn't exist on the replica. This makes replication break because it cannot continue in a consistent way.

What can make a table exist on the master and not on the replica? Several options:

The table was never there. In order to create a replica, RDS does a snapshot of the master and transfer it away. Depending on when or how it is done, the replica can be non-consistent (for example, I would be skeptical of a snapshot done at the same time than a running schema change). Maybe it was a memory only table and it wasn't properly transferred to disk? Maybe it was in a particular engine that has issues with transactionality (MyISAM).
The table was there, but it was deleted somehow.
Something else happened that caused the replicas to go out of sync. The exact reason will depend highly on the environement, configuration and the application. Some examples include:
- A crash of the replica in a non-crash safe configuration
- Replication coordinates were changed inappropriately or other maintenance introduced a mistake on replication
- Unsafe write statements or configuration is used, which leads to inconsistencies
- Filtering is used, leading to data inconsistencies
- Queries were run on the master with replication/binlog disabled

How to fix that?

The safest way to fix inconsistencies on a replica is to recreate it from scratch. You can copy a single table and synchronize it relatively easily if it has a low amount of writes, but that will not prevent you from having other data drift on other tables.

How to prevent that from happening again?

The most important part is to have checks in places to detect issues early. Tools such as pt-table-checksum or simple dumps and compares may allow you to detect differences earlier. It is impossible, with the information given, to say exactly what caused that, as mysql configuration will be safe or not depending on the specific setup. Some general advice:

Try using ROW based replication, it is usually more resilient against bad configuration/queries
Check for "unsafe statements for binlog" warnings on the logs
Check the logs for replication changes due to improper maintenance
Make sure replicas are really in read only and they cannot be written or manipulated out of band
Use innodb on all your tables, as well as crash-resilient configuration for your replicas

Related Solutions

MySQL row-based-replication HA_ERR_KEY_NOT_FOUND… which record is it

I think you should re-initialize your replication.

If the replication worked good, the rows should not be missing. That the rows are missing means there is some inconsistency between your master und your slave. I would not act on the assumption that this is the only error and if I'd fix it by hand everything is back fine again. Instead I would set up the replication from the start again and make sure it stays synced this time, to be sure to get rid of all inconsistencies.

Nevertheless there are some ways to identify which rows are missing. As a starting point you could take a look at this blogpost. This basically includes "decode the relay log" and "search for the current relay log position". One can decode the log for example with this statement:

mysqlbinlog mysql-relay-bin.000001 –base64-output=decode-rows -v > decoded.log

Mysql – Is it possible to recover from a LOST_EVENTS incident from an RDS read replica

In this specific case (a failed grant statement) stopping the slave, skipping the event, and restarting the slave is likely a safe operation... you just have to do it the RDS way.

mysql> CALL mysql.rds_skip_repl_error;

This is one of several stored procedures installed on every RDS instance that partially make up for the lack of SUPER privilege.

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.MySQL.CommonDBATasks.html

Best Answer

Related Solutions

MySQL row-based-replication HA_ERR_KEY_NOT_FOUND… which record is it

Mysql – Is it possible to recover from a LOST_EVENTS incident from an RDS read replica

Related Question