Mariadb – how to remove/deal with locks in a MariaDB Galera cluster

blockinggaleramariadbmariadb-10.1

I have a cluster of 3 nodes running MariaDB Galera cluster, cluster works as expected, I can reboot a single node at a time and in the best case, IST will be used or SST(using rsync) to sync back the node, all good so far.

I noticed that some DELETES or DDL's as described here: http://galeracluster.com/documentation-webpages/schemaupgrades.html, can lock the full cluster, therefore when possible and having access to the 3 nodes, I started to follow this procedure:

  1. SET wsrep_OSU_method='RSU';
  2. Run the ALTER/DELETE statement.
  3. SET wsrep_OSU_method='TOI';

The problem with this approach is that I need to run the procedure in all the 3 nodes and in some cases this is not possible since the full cluster is exposed using an HAproxy.

The more the database grows and if "accidentally" someone runs a DELETE or ALTER instead of trying to follow the "RSU/TOI" procedure, the whole database locks/blocks (no read/writes) and start affecting other databases within the cluster. An ugly but quick solution after "guessing" the node where the query that is locking is by running this query:

show open tables where in_use > 0;

Rebooting the node in question helps, and hopefully, the node will resync using IST like mentioned before and worst case SST. Restarting the mysqld process works but takes to much time that is better to just reboot the full VM since is faster to have the cluster unlocked and usable again.

Is not possible also to kill the query:

ERROR 1095 (HY000): You are not owner of thread 70760372

This "workaround" has become a bad habit, therefore, wondering what other techniques or procedures can be used to unlock the cluster in a graceful, clean and pragmatically way (something that could be automated) or even better what options/settings could help to prevent against the locks?

A long term solution could be to use a Master-slave setup, in where the "RSU/TOI" could be avoided, but before going that further, I would like to fine tune better the current cluster setup.

Best Answer

The only way that I found to fix this issue is to restart the whole cluster in the original order as restarting a single node just fails. I'm not sure why this is a thing in Galera, it seems to be very sensitive to errors like this. In my opinion a database cluster should just work no matter what queries you throw at it.

I guess it's something of a free add-on when you go the Galera road, unfortunately.