Mariadb – Reboot VMs the mariadb do not restart

galeramariadb

I install the MariaDB 10.1 on three VMs(controller1, controller2, controller3), and after I cluster them, it works well, but after I restart the three VMs, the mariadb_cluster did not restart.

In normal, the mariadb should restart too.


I use CentOS 7.2.


EDIT

I then stop mariadb on every VM:

systemctl stop mariadb.service

and in VM1 I use below command to restart my mariadb:

/usr/sbin/mysqld --wsrep-new-cluster --user=root & 

And in VM2 and VM3 I use:

systemctl start mariadb.service

To start the mariadb, but in the VM1 there is so many logs below:

...
    968961af,0
} joined {
} left {
} partitioned {
})
2017-07-05  0:41:11 139662434756800 [Note] WSREP: save pc into disk
2017-07-05  0:41:11 139662434756800 [Note] WSREP: discarding pending addr without UUID: tcp://10.1.1.120:4567
2017-07-05  0:41:11 139662434756800 [Note] WSREP: discarding pending addr proto entry 0x7f0527379f00
2017-07-05  0:41:11 139662434756800 [Note] WSREP: discarding pending addr without UUID: tcp://10.1.1.121:4567
2017-07-05  0:41:11 139662434756800 [Note] WSREP: discarding pending addr proto entry 0x7f0527379fc0
2017-07-05  0:41:11 139662434756800 [Note] WSREP: discarding pending addr without UUID: tcp://10.1.1.122:4567
2017-07-05  0:41:11 139662434756800 [Note] WSREP: discarding pending addr proto entry 0x7f052737a080
2017-07-05  0:41:11 139662434756800 [Note] WSREP: gcomm: connected
2017-07-05  0:41:11 139662434756800 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-07-05  0:41:11 139662434756800 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2017-07-05  0:41:11 139662434756800 [Note] WSREP: Opened channel 'openstack'
2017-07-05  0:41:11 139659854210816 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2017-07-05  0:41:11 139662434756800 [Note] WSREP: Waiting for SST to complete.
2017-07-05  0:41:11 139659854210816 [Note] WSREP: Starting new group from scratch: 9696bc2b-60d7-11e7-b202-a640d673651f
2017-07-05  0:41:11 139659854210816 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 9696f337-60d7-11e7-9eea-a22b7657716b
2017-07-05  0:41:11 139659854210816 [Note] WSREP: STATE EXCHANGE: sent state msg: 9696f337-60d7-11e7-9eea-a22b7657716b
2017-07-05  0:41:11 139659854210816 [Note] WSREP: STATE EXCHANGE: got state msg: 9696f337-60d7-11e7-9eea-a22b7657716b from 0 (controller1)
2017-07-05  0:41:11 139659854210816 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 0,
    members    = 1/1 (joined/total),
    act_id     = 0,
    last_appl. = -1,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = 9696bc2b-60d7-11e7-b202-a640d673651f
2017-07-05  0:41:11 139659854210816 [Note] WSREP: Flow-control interval: [16, 16]
2017-07-05  0:41:11 139659854210816 [Note] WSREP: Restored state OPEN -> JOINED (0)
2017-07-05  0:41:11 139659854210816 [Note] WSREP: Member 0.0 (controller1) synced with group.
2017-07-05  0:41:11 139659854210816 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 0)
2017-07-05  0:41:11 139662434481920 [Note] WSREP: New cluster view: global state: 9696bc2b-60d7-11e7-b202-a640d673651f:0, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 3
2017-07-05  0:41:11 139662434756800 [Note] WSREP: SST complete, seqno: 0
2017-07-05 00:41:11 7f05b1cfe8c0 InnoDB: Warning: Using innodb_locks_unsafe_for_binlog is DEPRECATED. This option may be removed in future releases. Please use READ COMMITTED transaction isolation level instead, see http://dev.mysql.com/doc/refman/5.6/en/set-transaction.html.
2017-07-05  0:41:11 139662434756800 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-07-05  0:41:11 139662434756800 [Note] InnoDB: The InnoDB memory heap is disabled
2017-07-05  0:41:11 139662434756800 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-07-05  0:41:11 139662434756800 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-07-05  0:41:11 139662434756800 [Note] InnoDB: Compressed tables use zlib 1.2.7
2017-07-05  0:41:11 139662434756800 [Note] InnoDB: Using Linux native AIO
2017-07-05  0:41:12 139662434756800 [Note] InnoDB: Using SSE crc32 instructions
2017-07-05  0:41:12 139662434756800 [Note] InnoDB: Initializing buffer pool, size = 2.0G
2017-07-05  0:41:12 139662434756800 [Note] InnoDB: Completed initialization of buffer pool
2017-07-05  0:41:12 139662434756800 [Note] InnoDB: Highest supported file format is Barracuda.
2017-07-05  0:41:12 139662434756800 [Note] InnoDB: 128 rollback segment(s) are active.
2017-07-05  0:41:12 139662434756800 [Note] InnoDB: Waiting for purge to start
2017-07-05  0:41:12 139662434756800 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.32-79.0 started; log sequence number 1624788
2017-07-05  0:41:12 139657069168384 [Note] InnoDB: Dumping buffer pool(s) not yet started
2017-07-05  0:41:12 139662434756800 [Note] Plugin 'FEEDBACK' is disabled.
2017-07-05  0:41:12 139662434756800 [Note] Server socket created on IP: '10.1.1.120'.
2017-07-05  0:41:12 139662434756800 [Warning] 'proxies_priv' entry '@% root@controller1' ignored in --skip-name-resolve mode.
2017-07-05  0:41:12 139662434481920 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-07-05  0:41:12 139662434481920 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-07-05  0:41:12 139659916711680 [Note] WSREP: Service thread queue flushed.
2017-07-05  0:41:12 139662434481920 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2017-07-05  0:41:12 139659916711680 [Note] WSREP: Service thread queue flushed.
2017-07-05  0:41:12 139662434481920 [Note] WSREP: Synchronized with group, ready for connections
2017-07-05  0:41:12 139662434481920 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-07-05  0:41:12 139662434756800 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.1.19-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server
2017-07-05  0:44:38 139659862603520 [Note] WSREP: (968961af, 'tcp://0.0.0.0:4567') connection established to fde5638c tcp://10.1.1.121:4567
2017-07-05  0:44:38 139659862603520 [Note] WSREP: (968961af, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2017-07-05  0:44:38 139659862603520 [Note] WSREP: (968961af, 'tcp://0.0.0.0:4567') connection established to 783fd9e3 tcp://10.1.1.122:4567
2017-07-05  0:44:38 139659862603520 [Note] WSREP: declaring 783fd9e3 at tcp://10.1.1.122:4567 stable
2017-07-05  0:44:38 139659862603520 [Note] WSREP: declaring fde5638c at tcp://10.1.1.121:4567 stable
2017-07-05  0:44:38 139659862603520 [Warning] WSREP: 968961af conflicting prims: my prim: view_id(PRIM,968961af,1) other prim: view_id(PRIM,783fd9e3,1)
2017-07-05  0:44:38 139659862603520 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=968961af,start_prim=1,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=3,checksum=0,instances=
    968961af,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,968961af,1),to_seq=2,weight=1,segment=0
,state_msgs=
,current_view=view(view_id(REG,783fd9e3,2) memb {
    783fd9e3,0
    968961af,0
    fde5638c,0
} joined {
    783fd9e3,0
    fde5638c,0
} left {
} partitioned {
}),pc_view=view(view_id(PRIM,968961af,1) memb {
    968961af,0
} joined {
} left {
} partitioned {
}),mtu=32636}
2017-07-05  0:44:38 139659862603520 [Note] WSREP: {v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=783fd9e3,srcvid=view_id(REG,783fd9e3,2),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=239,nl=(
)
} 64
2017-07-05  0:44:38 139659862603520 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=0,sr=-1,as=0,f=4,src=fde5638c,srcvid=view_id(REG,783fd9e3,2),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=8,nl=(
)
}
 state after handling message: evs::proto(evs::proto(968961af, OPERATIONAL, view_id(REG,783fd9e3,2)), OPERATIONAL) {
current_view=view(view_id(REG,783fd9e3,2) memb {
    783fd9e3,0
    968961af,0
    fde5638c,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} },
fifo_seq=222,
last_sent=0,
known:
783fd9e3 at tcp://10.1.1.122:4567
{o=1,s=0,i=1,fs=241,}
968961af at 
{o=1,s=0,i=1,fs=-1,}
fde5638c at tcp://10.1.1.121:4567
{o=1,s=0,i=1,fs=8,}
 }2017-07-05  0:44:38 139659862603520 [ERROR] WSREP: exception from gcomm, backend must be restarted: 968961af aborting due to conflicting prims: older overrides (FATAL)
     at gcomm/src/pc_proto.cpp:handle_state():982
2017-07-05  0:44:38 139659862603520 [Note] WSREP: gcomm: terminating thread
2017-07-05  0:44:38 139659862603520 [Note] WSREP: gcomm: joining thread
2017-07-05  0:44:38 139659862603520 [Note] WSREP: gcomm: closing backend
2017-07-05  0:44:38 139659862603520 [Note] WSREP: Forced PC close
2017-07-05  0:44:38 139659862603520 [Warning] WSREP: discarding 3 messages from message index
2017-07-05  0:44:38 139659862603520 [Note] WSREP: gcomm: closed
2017-07-05  0:44:38 139659854210816 [ERROR] WSREP: gcs/src/gcs_core.cpp:core_handle_comp_msg():718: Malformed component message. Ignoring

Best Answer

Starting a new cluster failed because not all nodes of the previous cluster had been shutdown before.
This can be learned from these lines from the log output in the question:

WSREP: 968961af conflicting prims: my prim: view_id(PRIM,968961af,1) other prim: view_id(PRIM,783fd9e3,1)
WSREP: exception from gcomm, backend must be restarted: 968961af aborting due to conflicting prims: older overrides (FATAL)

The conflicting node is the one at 783fd9e3 at 10.1.1.122 which can be learned from

783fd9e3 at tcp://10.1.1.122:4567

where the hex id is the one from behind "other prim".