MySQL stuck on “wsrep: initiating replication for write set (-1)”

galeraMySQLmysql-5.7percona-server

I ran into a situation with MySQL recently which I hadn't seen before. We have a Percona cluster with 3 nodes. The master stopped processing queries and the PHP FPM web application we host became unresponsive. When I checked SHOW PROCESSLIST, MySQL processes were stuck in state: wsrep: initiating replication for write set (-1)

This was the case on both the primary, where PHP was directing all queries to, as well as one of the two secondaries. Here's the output I saw from SHOW PROCESSLIST:

mysql> show processlist;
+-------+-------------+--------------------+-----------------+---------+-------+--------------------------------------------------------+--------------------------------------------------------------------------------------+-----------+---------------+
| Id    | User        | Host               | db              | Command | Time  | State                                                  | Info                                                                                 | Rows_sent | Rows_examined |
+-------+-------------+--------------------+-----------------+---------+-------+--------------------------------------------------------+--------------------------------------------------------------------------------------+-----------+---------------+
|     1 | system user |                    | NULL            | Sleep   |  2171 | wsrep: committing write set (542480920)                | NULL                                                                                 |         0 |             0 |
|     2 | system user |                    | NULL            | Sleep   | 17169 | wsrep: aborter idle                                    | NULL                                                                                 |         0 |             0 |
|     4 | system user |                    | NULL            | Sleep   |  3250 | wsrep: deleting row for write-set (542480919)          | NULL                                                                                 |         0 |             0 |
| 46944 | $user1      | 172.24.62.92:54004 | $user1_db1      | Query   |  2158 | wsrep: initiating pre-commit for write set (542481004) | delete from $table where $col < '$val'                                               |         0 |             1 |
| 47126 | $user1      | 172.24.62.92:54745 | $user1_db2      | Query   |  2096 | wsrep: initiating replication for write set (-1)       | update $table2 set $col = current_timestamp where $col2 = 393 and $col3 = 176935     |         0 |             1 |
| 47155 | $user1      | 172.24.62.92:54841 | $user1_db3      | Query   |  2089 | wsrep: initiating replication for write set (-1)       | UPDATE $table SET $col5 = 'something' WHERE $somecol = '$someval'                    |         0 |             1 |
| 47416 | $user1      | 172.24.62.92:55891 | $user1_db3      | Query   |  1950 | wsrep: initiating replication for write set (-1)       | UPDATE $table SET $col5 = 'something' WHERE $somecol = 's'                           |         0 |             1 |
| 47576 | $user1      | 172.24.62.92:56493 | $user1_db3      | Query   |  1849 | wsrep: initiating replication for write set (-1)       | INSERT INTO $table3  ($column6, $column7, $column8, $column9, $column10 ...          |         0 |             0 |
| 47654 | $user1      | 172.24.62.92:56808 | $user1_db2      | Query   |  1924 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 43625 a |         0 |             1 |
| 48036 | $user1      | 172.24.62.92:58343 | $user1_db2      | Query   |  1795 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 248528  |         0 |             1 |
| 48936 | $user1      | 172.24.62.92:61929 | $user1_db2      | Query   |  1495 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 156001  |         0 |             1 |
| 48952 | $user1      | 172.24.62.92:61982 | $user1_db2      | Query   |  1490 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 205495  |         0 |             1 |
| 49497 | $user1      | 172.24.62.92:64167 | $user1_db2      | Query   |  1306 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 234457  |         0 |             1 |
| 49510 | $user1      | 172.24.62.92:64218 | $user1_db2      | Query   |  1302 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 209489  |         0 |             1 |
| 49839 | $user1      | 172.24.62.92:65534 | $user1_db2      | Query   |  1192 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 70958 a |         0 |             1 |
| 49970 | $user1      | 172.24.62.92:1539  | $user1_db2      | Query   |  1096 | wsrep: initiating replication for write set (-1)       | update $table set $col11 = $col11 + 1 where id = $val                                |         0 |             1 |
| 50292 | $user1      | 172.24.62.92:2819  | $user1_db2      | Query   |  1041 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 193078  |         0 |             1 |
| 50398 | $user1      | 172.24.62.92:3240  | $user1_db2      | Query   |  1006 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 242842  |         0 |             1 |
| 51120 | $user1      | 172.24.62.92:6135  | $user1_db2      | Query   |   763 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 173382  |         0 |             1 |
| 51453 | $user1      | 172.24.62.92:7456  | $user1_db1      | Query   |   653 | wsrep: initiating replication for write set (-1)       | delete from $table5 where expiry < 1496379436                                        |         0 |             2 |
| 51460 | $user1      | 172.24.62.92:7475  | $user1_db1      | Query   |   651 | wsrep: initiating replication for write set (-1)       | insert into $table5 values ('...                                                     |         0 |             0 |
| 51504 | $user1      | 172.24.62.92:7646  | $user1_db1      | Query   |   587 | wsrep: initiating replication for write set (-1)       | insert into $table5 values ('...                                                     |         0 |             0 |
| 51525 | $user1      | 172.24.62.92:7721  | $user1_db1      | Query   |   631 | wsrep: initiating replication for write set (-1)       | insert into $table5 values ('...                                                     |         0 |             0 |
| 51998 | $user1      | 172.24.62.92:9585  | $user1_db2      | Query   |   475 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 203223  |         0 |             1 |
| 52290 | $user1      | 172.24.62.92:10759 | $user1_db2      | Query   |   377 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 185874  |         0 |             1 |
| 53055 | $user1      | 172.24.62.92:13797 | $user1_db2      | Query   |   123 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 89879 a |         0 |             1 |
| 53303 | $user1      | 172.24.62.92:14793 | $user1_db2      | Query   |    39 | wsrep: initiating replication for write set (-1)       | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 146551  |         0 |             1 |
| 53396 | $user1      | 172.24.62.92:15176 | $user1_db2      | Query   |     7 | updating                                               | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 146551  |         0 |             0 |
| 53403 | $user1      | 172.24.62.92:15205 | $user1_db2      | Query   |     5 | updating                                               | update $table4 set $col11 = current_timestamp where $col12 = 393 and $col1 = 146551  |         0 |             0 |
| 53410 | root        | localhost          | NULL            | Query   |     0 | starting                                               | show processlist                                                                     |         0 |             0 |
+-------+-------------+--------------------+-----------------+---------+-------+--------------------------------------------------------+--------------------------------------------------------------------------------------+-----------+---------------+
30 rows in set (0.00 sec)

(Table and column names changed to protect the innocent)

I was unable to figure out how to recover from this situation, and we ended up having to restart the entire cluster (rolling reboot) and re-bootstratp the cluster using service mysql bootstrap-pxc

We had just recently upgraded to Percona 5.7… We had also changed Set Transactions via SET GLOBAL tx_isolation='READ-COMMITTED'; as referenced by MySQL Docs in order to try and improve performance of SELECTs. We're unsure if either of these might have caused the situation we encountered.

What does it mean if MySQL gets stuck in state wsrep: initiating replication for write set (-1) and what are some possible causes (and fixes) for that?

Best Answer

There seems like total crashes like this may have been due to some bug, probably introduced in 5.7.17 and probably fixed in 5.7.20, ref https://jira.percona.com/browse/PXC-877

I have noticed "blip"s in our systems with lots of queries hanging with the same status message, but the problem resolved itself within a minute or two, so probably not the same bug. It seems this happens with one (or more) of the masters if another master is doing some big write operation. I did indeed take down the whole cluster once to try to clear the backlog - this was a mistake, caused quite some downtime, the more proper fix would have been to redirect the mysql traffic to the one server that wasn't clogged and wait for the others to calm down by themselves.

$ mysqld --version mysqld Ver 5.7.23-23-57 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release rel23, Revision f5578f0, WSREP version 31.31, wsrep_31.31)