MySQL: InnoDB keeps crashing – how to recover

innodbMySQLrecoveryreplication

# free -m
             total       used       free     shared    buffers     cached
Mem:         48289      35288      13000          0        347      30399
-/+ buffers/cache:       4541      43747
Swap:         8189         51       8137

mysql-server-5.5.25a-1.el5.remi
my.cnf: http://fpaste.org/PQLU/

MySQL cannot start with below errors in /var/log/mysqld.log: http://fpaste.org/4VMB/

It can be only started when adding innodb_force_recovery = 1 to my.cnf, but then I get another error when starting the server: http://fpaste.org/6azJ/

This server used to be the master, but I was able to promote a slave as the new master. Currently I am trying to set this failed-master up as a new slave, but I can't get it to start.

What should I do now?

UPDATE Thu Jul 19 23:50:17 ICT 2012:

It was started successfully with innodb_force_recovery=2, but MySQL go away when doing a DROP TABLE:

mysql> drop table reportingdb.bigdata_banner_scheduler;
ERROR 2013 (HY000): Lost connection to MySQL server during query

Here're the logs: http://fpaste.org/M82a

UPDATE Fri Jul 20 08:02:57 ICT 2012:

I've been trying to rebuild replication using Percona Xtrabackup. At the first time, I get this bug when copying with innobackupex. Thanks to @DTest who suggest increase innodb_log_file_size to 1GB and it's OK.

Please noted that: you should copy innodb_* settings from the Master to Slave and run innobackupex --apply-log /path/to/datadir on the Slave if you don't want to get the below errors:

120720  6:18:50  InnoDB: Error: page 3670052 log sequence number 8078993744933
InnoDB: is in the future! Current system log sequence number 8078561559052.
InnoDB: Your database may be corrupt or you may have copied the InnoDB
InnoDB: tablespace but not the InnoDB log files. See
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: for more information.
InnoDB: Error: trying to access page number 2175909760 in space 0,
InnoDB: space name ./ibdata1,
InnoDB: which is outside the tablespace bounds.
InnoDB: Byte offset 0, len 16384, i/o type 10.
InnoDB: If you get this error at mysqld startup, please check that
InnoDB: your my.cnf matches the ibdata files that you have in the
InnoDB: MySQL server.
120720  6:18:50  InnoDB: Assertion failure in thread 47633462918272 in file fil0fil.c line 4434
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
23:18:50 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

But the game is NOT finish: slave continue crashing after some minutes:

120720  7:58:28 [Warning] Slave SQL: Could not execute Write_rows event on table reportingdb.ox_banners; Duplicate entry '14
5928' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000999, end
_log_pos 337836040, Error_code: 1062
120720  7:58:28 [Warning] Slave SQL: Could not execute Write_rows event on table reportingdb.selfserving_img_signatures; Dup
licate entry '145928' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql
-bin.000999, end_log_pos 337843612, Error_code: 1062
120720  7:58:28 [Warning] Slave SQL: Could not execute Write_rows event on table reportingdb.selfserving_email_log; Duplicat
e entry '173213' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.
000999, end_log_pos 337844062, Error_code: 1062
00:58:29 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

key_buffer_size=1048576
read_buffer_size=1048576
max_used_connections=4
max_threads=2000
thread_count=2
connection_count=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4119820 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x11cc5f20
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 40c73a78 thread_stack 0x40000
/usr/libexec/mysqld(my_print_stacktrace+0x2e)[0x7af52e]
/usr/libexec/mysqld(handle_fatal_signal+0x3e2)[0x67c242]
/lib64/libpthread.so.0[0x3fed00ebe0]
/usr/libexec/mysqld(_ZN13st_select_lex17mark_as_dependentEPS_+0x4d)[0x568a3d]
/usr/libexec/mysqld[0x68cc02]
/usr/libexec/mysqld(_ZN10Item_field15fix_outer_fieldEP3THDPP5FieldPP4Item+0x670)[0x690c90]
/usr/libexec/mysqld(_ZN10Item_field10fix_fieldsEP3THDPP4Item+0x351)[0x691361]
/usr/libexec/mysqld(_ZN9Item_func10fix_fieldsEP3THDPP4Item+0x1d3)[0x6cb433]
/usr/libexec/mysqld(_Z11setup_condsP3THDP10TABLE_LISTS2_PP4Item+0x1a5)[0x53aae5]
/usr/libexec/mysqld(_Z20mysql_prepare_updateP3THDP10TABLE_LISTPP4ItemjP8st_order+0x118)[0x5df3e8]
/usr/libexec/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_ordery15enum_duplicatesbPySB_+0x2b4)[0x5e0134]
/usr/libexec/mysqld(_Z21mysql_execute_commandP3THD+0x239b)[0x575c5b]
/usr/libexec/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x10a)[0x57994a]
/usr/libexec/mysqld(_ZN15Query_log_event14do_apply_eventEPK14Relay_log_infoPKcj+0xc57)[0x734757]
/usr/libexec/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14Relay_log_info+0x16e)[0x516fce]
/usr/libexec/mysqld[0x51e631]
/usr/libexec/mysqld(handle_slave_sql+0xc46)[0x51f946]
/lib64/libpthread.so.0[0x3fed00677d]
/lib64/libc.so.6(clone+0x6d)[0x3fec8d325d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (128380d7): UPDATE `ox_banners` A
        SET A.locationAd=@locCP 
        WHERE A.zoneId = NAME_CONST('_zoneid',2452)
Connection ID (thread ID): 2061
Status: NOT_KILLED

slave-skip-errors = 1062 seems not to work.

I'm going to take Master's snapshot using mysqldump, hope it can solve the crashing problem.

Best Answer

From a chat discussion, the first error is because the file ./reportingdb/bigdata_banner_scheduler.ibd is missing. Just copying this file from the master won't work, though. You'll need to drop the table on the slave and then dump the table from the master.

But that assert error is a different matter. You can start in force_recovery mode 1, but something is killing the mysqld process, and it doesn't appear to be memory (misconfiguration).

Since you are trying to set this up as a slave of the recently-promoted master, I would actually wipe the data, and re-install MySQL and start from a fresh copy of the master.

If for whatever reason you'd like to try to get it working without dumping the entire master (not recommended) my steps would be this:

Put skip-slave-start in the my.cnf to disable the slave from starting automatically
Take out innodb-force-recovery from the my.cnf
Copy all the files from the datadir to a separate location on your slave server
Re-install mysql (steps for this depends on your OS)
Copy the mysql and performance_schema directories from the old install back into the freshly installed datadir.
Start the mysql server to make sure the server starts normally and without issue.
If it does, stop the server again and continue these steps
Put innodb-force-recovery to 1 in the my.cnf
Copy all the other files from your backup back into the datadir
Start the server. This should put you in a state where you can DROP the missing ./reportingdb/bigdata_banner_scheduler.ibd table.
DROP TABLE reportingdb.bigdata_banner_scheduler
Stop the server
Remove the innodb-force-recovery from my.cnf
Start the server.

At this point, if all went well, you should have a working 'slave' server without the reportingdb.bigdata_banner_scheduler table, and slave should still be disabled (not reading from the master's binlog).

The steps I would do to get the table back on the slave are:

From the master, take a dump of the table structure and data: mysqldump -u.. -p reportingdb bigdata_banner_scheduler > reportingBigData.sql
Copy the dump to the slave
Import the dump back into the slave: mysql -u... -p reportingdb < reportingBigData.sql
Then start slave to have it start catching up on missing binlog events

OPTION 1

Skip the error, wait 5 seconds, and view the Slave Status. Here the 5 steps for Skipping an Error

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;
SELECT SLEEP(5);
SHOW SLAVE STATUS\G

When you view the Slave Status, here is what to expect

If Seconds_Behind_Master is NULL
- Replication is Broken : Look for Tell-Tale Signs
- If Error Number is 1062 again, Repeat the 5 steps for Skipping an Error
If Seconds_Behind_Master is a Number
- Replication is running
- When Seconds_Behind_Master > 0, Replication is Catching Up.
- When Seconds_Behind_Master = 0, Replication is Fully Caught Up.

OPTION 2

Remove the row to allow replication to continue

Delete the row from the table on the Slave and do the following 4 Steps for Skipping an Error:

STOP SLAVE;
START SLAVE;
SELECT SLEEP(5);
SHOW SLAVE STATUS\G

At the risk of sounding redundant...

When you view the Slave Status, here is what to expect

If Seconds_Behind_Master is NULL
- Replication is Broken : Look for Tell-Tale Signs
- If Error Number is 1062 again, delete the row Repeat the 4 steps for Skipping an Error
If Seconds_Behind_Master is a Number
- Replication is running
- When Seconds_Behind_Master > 0, Replication is Catching Up.
- When Seconds_Behind_Master = 0, Replication is Fully Caught Up.

What if there are just too many duplicate key issues? Here are some of my earlier posts concerning how to use MAATKIT's mk-table-checksum, mk-table-sync, pt-table-checksum, pt-table-sync:

MySql replication from 5.1 master to 5.6 slave keep crashing

Your master is not compatible with the 5.6 slave. binlog_checksum is only available on 5.6, and so is UUID. I have successfully replicated servers between 5.5 and 5.6. Perhaps you can check if these options can be disabled.

2014-11-13 14:03:07 1022 [Warning] Slave I/O: Notifying master by SET @master_binlog_checksum= @@global.binlog_checksum failed with error: Unknown system variable 'binlog_checksum', Error_code: 1193 2014-11-13 14:03:07 1022 [Warning] Slave I/O: Unknown system variable 'SERVER_UUID' on master. A probable cause is that the variable is not supported on the master (version: 5.1.71-log), even though it is on the slave (version: 5.6.21-log), Error_code: 1193

Best Answer

Related Solutions

MySQL Replication – Fixing Replication Errors

OPTION 1

OPTION 2

MySql replication from 5.1 master to 5.6 slave keep crashing

Related Question