MariaDB – How to Fix WSREP Failed to Recover Position Error on Ubuntu

galeramariadbmariadb-10.1MySQLUbuntu

I'm using 10.1.16-MariaDB-1~xenial from the official MariaDB apt repository for 10.1 [stable], via the University of Texas mirror.

I had a perfectly functioning MariaDB Galera cluster setup on 3 Ubuntu 16.04 servers.

Then I upgraded them. Now I have nothing.

The upgrade to 10.1.16 failed, and quickly brought down the whole cluster. I don't have the output, but dpkg failed on setting up mariadb-server and mariadb-server-10.1.

I have backups, so I purged all traces of MariaDB/MySQL/Galera from my servers (including removing /var/lib/mysql/, /etc/mysql/, and /var/log/mysql/) and started over. However, now, with a clean install on each server, none of the standard system startup scripts work. I suspect this is why the upgrade process through apt failed, too.

I've tried each of the following on my first node:

galera_new_cluster
service mysql bootstrap
service mysql bootstrap --wsrep-new-cluster
service mysql bootstrap --wsrep-cluster-address="gcomm://"
service mysql start
service mysql start --wsrep-new-cluster
service mysql start --wsrep-cluster-address="gcomm://"
systemctl start mariadb
systemctl start mariadb --wsrep-new-cluster
systemctl start mariadb --wsrep-cluster-address="gcomm://"

Every single one gives me the same output:

Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

systemctl status mariadb.service:

● mariadb.service - MariaDB database server
   Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: failed (Result: exit-code) since Fri 2016-07-22 13:29:45 CDT; 42s ago
  Process: 10799 ExecStartPre=/bin/sh -c VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] &&   systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=1/FAILURE)
  Process: 10794 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
 Main PID: 16865 (code=exited, status=0/SUCCESS)

Jul 22 13:29:41 sql2 systemd[1]: Starting MariaDB database server...
Jul 22 13:29:45 sql2 mysqld[10799]: WSREP: Failed to recover position: '2016-07-22 13:29:41 140110745778432 [Note] /usr/sbin/mysqld (mysqld 10.1.16-MariaDB-1~xenial) starting as process 11080 ...'
Jul 22 13:29:45 sql2 systemd[1]: mariadb.service: Control process exited, code=exited status=1
Jul 22 13:29:45 sql2 systemd[1]: Failed to start MariaDB database server.
Jul 22 13:29:45 sql2 systemd[1]: mariadb.service: Unit entered failed state.
Jul 22 13:29:45 sql2 systemd[1]: mariadb.service: Failed with result 'exit-code'.

The only way I can start my servers now is by manually executing:

sudo -u mysql mysqld --wsrep-cluster-address='gcomm://'

On the first node, and then:

sudo -u mysql mysqld --wsrep-cluster-address='gcomm://ip1,ip2,ip3'

On the other two nodes. That works, and I have a working cluster again. But now, systemd/systemctl have no idea the service is running. It seems like the systemd startup scripts can't use the wsrep-cluster-address setting in my configuration files at all. Specifying it to service or systemctl command line does not work either.

How am I supposed to start mariadb?

Best Answer

There was a bug in galera_recovery.sh script. https://jira.mariadb.org/browse/MDEV-10396