Mysql – how to safely stop thesqldump

debianlinuxMySQLmysql-5.5mysqldump

I (well, my cron script) tried killall mysqldump and that didn't end very well – mysql server stopped accepting connections after a while.

It was Debian Jessie machine with mysql 5.5.55-0+deb8u1.

Usage scenario was:

  • there was a long running (few hours) SELECT query which was either really that slow or the client that sent it was having problems (query state was Sending data), but all the other queries were happily coming and going (only the load was perhaps little higher).

  • in the night backup was being run with mysqldump --max_allowed_packet=2147483648 --hex-blob --single-transaction --master-data --routines --order-by-primary --databases db1 db2 db3... | pigz -p8 > backup.sql.gz. It never finished, probably because it was waiting for the SELECT above to finish first (guessing here – it was only thing which looked out of ordinary, and same setup worked OK for months).

  • cron job run in the morning, which had killall -q mysqldump which was supposed to safely terminate backup in case it was not finished by set time (notifying the admin to examine and fix the problem later), thus allowing people to continue working with mysql server normally.

  • result however was full connection table and thus no user able to log in to mysql server. There was FLUSH /*!40101 LOCAL */ TABLES query stuck in Waiting for table flush and hundreds of SELECT queries stuck in same Waiting for table flush state.

  • in addition, admin killing LOCK TABLES mysql query didn't help, as other SELECT queries remained in Waiting for table flush (which seems to be intended behavior?)

Restarting mysql server finally "fixed" the problem. However, Wanting to avoid this situation (and emergency admin interventions) repeating , I'd like to safely terminate mysqldump backup in Debian Jessie mysql-5.5.55 (or upcoming Debian Stretch mariadb-10.1.23-8). Is there a way?

If not, what are other options for accomplishing mysql backup and avoiding server load in the morning (which is – in this case – almost as bad as completely hung server)?

(I'd like to stay with Debian Stable packages if at all possible)

Best Answer

Since you are using --master_data to take a consistent value of master status.

The internals of mysqldump will issue below commands to mysql server.

2017-05-31T04:39:05.843130Z    48 Query /*!40100 SET @@SQL_MODE='' */
2017-05-31T04:39:05.843273Z    48 Query /*!40103 SET TIME_ZONE='+00:00' */
2017-05-31T04:39:05.843411Z    48 Query FLUSH /*!40101 LOCAL */ TABLES
2017-05-31T04:39:05.846031Z    48 Query FLUSH TABLES WITH READ LOCK
2017-05-31T04:39:05.846166Z    48 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
2017-05-31T04:39:05.846279Z    48 Query START TRANSACTION /*!40100 WITH CONSISTENT SNAPSHOT */
2017-05-31T04:39:05.846413Z    48 Query SHOW MASTER STATUS
2017-05-31T04:39:05.846539Z    48 Query UNLOCK TABLES
..
..... Here it continues to take backup of data and structures .

What had happened?:

Your backup would have just started and there was a query that was going on running on a particular table for a longer period of time before even FLUSH TABLES command and didn't release the lock on the table and FLUSH TABLES must have been waiting for that thread to get completed or keep trying to flush until the revision_version of that table is same as all tables.

Thus you get other threads blocked for other tables as well. As this is entire DBs*.Tables* level lock while flush tables was going on. Finally it would have got accumulated every new connection in processlist and piled up until max_connections and not allowing anyone to login.

Let's say if you have managed to login to the terminal and tried to kill flush tables, I don't think there is a way to pull back or rollback the flushed tables that has been done and release its own thread connection. So it might in a KILLED STATE for longer time. And thus you might have reached the last option that is to restart the server.

How to fix it?:

At the time of issue, when admin managed to login to mysql prompt.

Instead of issuing kill command on FLUSH TABLES thread, if kill was given to the thread running for long SELECT.There are chances that SELECT would have dropped and table is left FLUSH TABLES to acquire and update revision_version and release the lock for new queries. And backup would have continued. Since I don't think anyone expecting an answer on the other end waiting for the result for query running long hrs.

What is the long term solution?:

  • You have to ensure no such long time queries to be running at the time of backup.

  • Looks like this could be a new deployment or someone triggered a bad query and didn't bother to close the session.

  • Try to kill the query if a query is running more than Xsecs (Depends upon your requirement). Or

  • Have an adjustment with the teams to tune the queries or send it to Admin to give them the result data or give them a separate slave all together.