Mysql – pt-table-sync error: Called not_in_left in state 0

MySQLpercona-tools

I have setup a Mysql replication between 2 servers, using Percona Xtrabackup:

Master is a MySQL 5.0.91 Community Edition (CentOS 4.8)

Slave is a MySQL 5.1.68 Community Edition (CentOS 6.4)

When starting the slave, some replication queries where blocked because of some unknown "temp" tables.
I used a few SQL_SLAVE_SKIP_COUNTER commands to hide the problem. And now that the replication is up to date, I try to resync the tables.
=> 2 tables are out of sync. I use pt-table-sync to resync.

The first table has been resynced without any problem (a few UPDATE to replay)

But the second table, a huge table (57GB), give me this error after some time (varying from a few minutes to a few hours):

pt-table-sync --verbose --execute -uroot -p h=10.2.0.1,D=MYDB,t=MyTable h=10.2.0.2

# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
Called not_in_left in state 0 at /usr/bin/pt-table-sync line 5500.  while doing MYDB.MyTable on 10.2.0.2
#      0       0      0      0 0         16:38:24 19:08:17 1    MYDB.MyTable

Note that I launch pt-table-sync from a third server on the local network.

I don't find much information about this error.
What would you recommand to help me solve this problem?

Best Answer

You could try:

Checking the MySQL error log after you attempt the sync. It may reveal some consistency issues with the table that you weren't aware of
Working around this table for now (using --exclude-table), and then coming back to it later
Trying different combinations of checksum options (--algorithms, --chunk-size) for the problem table - this might fix the problem, or shed more light on the underlying issue

I received this error recently, and the underlying issue turned out to be a corrupt table in need of repair on the master server.

I discovered the problem by changing the behaviour of pt-table-sync to using the "nibble" algorithm for my problem table (--algorithms nibble). Instead of the cryptic error, I got a specific error along with the SQL Statement that caused it, when I then attempted to execute directly on the server. This led me straight to my problem.

I could have also checked the error log and discovered the same thing.

Alternatively, you could try changing algorithms or reducing the chunk size from 1,000 (e.g. --chunk-size 100). This may require a bit of trial and error, and it's likely to slow down the checksums significantly, which is why I suggest skipping your largest tables to start with.

For a sufficiently large table (and 57GB certainly qualifies) with a poorly distributed primary key, you might also experience problems with query or wait timeouts depending on your MySQL configuration - I'm not sure if this would also result in the same error. The pt-table-sync documentation offers a little bit more information on this.

DELETE ALL BINARY LOGS

If the DB Server is not used as MySQL Replication Master, then run

BINARY LOGGING IS ENABLED

mysql> RESET MASTER;

That will erase all binary logs and start with the first one (like mysql-bin.000001)

BINARY LOGGING IS NOT ENABLED

Simply go the the OS and run the delete command (rm for Linux,del for Windows)

PURGE BINARY LOGS

If you want to keep binary logs from the last 48 hours, you run

mysql> PURGE BINARY LOGS BEFORE NOW() - INTERVAL 48 HOUR;

mysqld will erase all binary logs older than 48 hours.

AUTOMATIC PURGING

Simply set the following in my.cnf (or my.ini)

[mysqld]
expire_logs_days=2

Then, either restart mysqld or run the following as root

mysql> SET GLOBAL expire_logs_days = 2;

That way, every manual or automatic flush of binary logs will delete logs older than 2 days.

Once expire_logs_days is set, mysqldump will cleanup old logs every time it does --flush-logs.

Mysql – Strange MySQL replication error 1146 (Table doesn’t exist)

Not too sure why still, but this problem disappeared after some of the troubleshooting steps I took to hunt down the culprit. Maybe it was the open files limit that caused the errors while opening the MyISAM tables, perhaps because of the high activity of other services running on the slave2 server, like Rackspace's cloud backup... But, whichever the case, replication has been running smoothly on slave2 off of my master server for a number of weeks already after loading it up once more with a fresh data snapshot taken from slave1 in the same fashion as described in my original question.

So, although unfortunately I can't provide a clear answer to the problem, I can definitely say it looks to have been solved. So I'm closing this post for the time being.