Percona MySQL – Unable to Start MySQL After Restart

galeraMySQLperconaxtradb-cluster

I just configured a percona xtradb cluster in ubuntu with 2 node.

This is what I did.

  1. launched 2 new servers.
  2. Installed percona cluster on both nodes.
  3. Stopped mysql on both nodes.

On server1:

/etc/mysql/percona-xtradb-cluster.conf.d/wsrep.cnf
wsrep_cluster_address=gcomm://172.31.5.172,172.31.13.81
binlog_format=ROW
wsrep_slave_threads= 8
wsrep_log_conflicts
innodb_autoinc_lock_mode=2
wsrep_node_address=172.31.5.172
wsrep_cluster_name=pxc-cluster
pxc_strict_mode=DISABLED
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth="sstuser:s3cretPass"

On Server 2:

/etc/mysql/percona-xtradb-cluster.conf.d/wsrep.cnf
wsrep_cluster_address=gcomm://172.31.5.172,172.31.13.81
binlog_format=ROW
wsrep_slave_threads= 8
wsrep_log_conflicts
innodb_autoinc_lock_mode=2
wsrep_node_address=172.31.13.81
wsrep_cluster_name=pxc-cluster
pxc_strict_mode=DISABLED
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth="sstuser:s3cretPass"

On server1:
/etc/init.d/mysql bootstrap-pxc
Then mysql started.

CREATE USER 'sstuser'@'localhost' IDENTIFIED BY 's3cretPass';
GRANT PROCESS, RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost';
FLUSH PRIVILEGES;

On Server 2:

service mysql start

As of now no issues, tested the replicatin. Then I did a restart both nodes.

After that mysql won't start.

Server1:

2017-11-05T17:22:03.505946Z mysqld_safe Logging to '/var/log/mysqld.log'.
2017-11-05T17:22:03.515607Z mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
2017-11-05T17:22:03.520623Z mysqld_safe Skipping wsrep-recover for cc6a7fb6-c24b-11e7-bd13-ba1518e34f61:0 pair
2017-11-05T17:22:03.521158Z mysqld_safe Assigning cc6a7fb6-c24b-11e7-bd13-ba1518e34f61:0 to wsrep_start_position
2017-11-05T17:22:03.706708Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2017-11-05T17:22:03.708134Z 0 [Note] /usr/sbin/mysqld (mysqld 5.7.19-17-57-log) starting as process 11089 ...
2017-11-05T17:22:03.709661Z 0 [Warning] No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when this MySQL server acts as a master and has his hostname changed!! Please use '--log-bin=ip-172-31-5-172-bin' to avoid this problem.
2017-11-05T17:22:03.709964Z 0 [Note] WSREP: Setting wsrep_ready to false
2017-11-05T17:22:03.709977Z 0 [Note] WSREP: No pre-stored wsrep-start position found. Skipping position initialization.
2017-11-05T17:22:03.709982Z 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera3/libgalera_smm.so'
2017-11-05T17:22:03.712249Z 0 [Note] WSREP: wsrep_load(): Galera 3.22(r8678538) by Codership Oy <info@codership.com> loaded successfully.
2017-11-05T17:22:03.712290Z 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2017-11-05T17:22:03.712549Z 0 [Note] WSREP: Found saved state: cc6a7fb6-c24b-11e7-bd13-ba1518e34f61:0, safe_to_bootsrap: 1
2017-11-05T17:22:03.713823Z 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.31.5.172; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 4; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_count = 0; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 100; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 7; socket.checksum = 2; socket.recv_buf_size = 212992;
2017-11-05T17:22:03.727615Z 0 [Note] WSREP: Assign initial position for certification: 0, protocol version: -1
2017-11-05T17:22:03.727630Z 0 [Note] WSREP: Preparing to initiate SST/IST
2017-11-05T17:22:03.727634Z 0 [Note] WSREP: Starting replication
2017-11-05T17:22:03.727640Z 0 [Note] WSREP: Setting initial position to cc6a7fb6-c24b-11e7-bd13-ba1518e34f61:0
2017-11-05T17:22:03.727741Z 0 [Note] WSREP: Using CRC-32C for message checksums.
2017-11-05T17:22:03.727792Z 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2017-11-05T17:22:03.727852Z 0 [Warning] WSREP: Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
2017-11-05T17:22:03.727859Z 0 [Note] WSREP: Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown
2017-11-05T17:22:03.728266Z 0 [Note] WSREP: GMCast version 0
2017-11-05T17:22:03.728398Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-11-05T17:22:03.728405Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-11-05T17:22:03.728686Z 0 [Note] WSREP: EVS version 0
2017-11-05T17:22:03.728771Z 0 [Note] WSREP: gcomm: connecting to group 'pxc-cluster', peer '172.31.5.172:,172.31.13.81:'
2017-11-05T17:22:03.728686Z 0 [Note] WSREP: EVS version 0
2017-11-05T17:22:03.728771Z 0 [Note] WSREP: gcomm: connecting to group 'pxc-cluster', peer '172.31.5.172:,172.31.13.81:'
2017-11-05T17:22:03.729595Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') connection established to d73a400b tcp://172.31.5.172:4567
2017-11-05T17:22:03.729609Z 0 [Warning] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') address 'tcp://172.31.5.172:4567' points to own listening address, blacklisting
2017-11-05T17:22:06.729830Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.31.13.81:4567 timed out, no messages seen in PT3S
2017-11-05T17:22:06.729954Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') connection to peer d73a400b with addr tcp://172.31.5.172:4567 timed out, no messages seen in PT3S
2017-11-05T17:22:06.730072Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-11-05T17:22:06.730101Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,d73a400b,1)
memb {
        d73a400b,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2017-11-05T17:22:07.230228Z 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50154S), skipping check
2017-11-05T17:22:10.730540Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.31.13.81:4567 timed out, no messages seen in PT3S
2017-11-05T17:22:14.731257Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.31.13.81:4567 timed out, no messages seen in PT3S
2017-11-05T17:22:16.154486Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') connection established to de905d70 tcp://172.31.13.81:4567
2017-11-05T17:22:16.154851Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2017-11-05T17:22:16.654815Z 0 [Note] WSREP: declaring de905d70 at tcp://172.31.13.81:4567 stable
2017-11-05T17:22:16.655411Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-11-05T17:22:16.655466Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,d73a400b,2)
memb {
        d73a400b,0
        de905d70,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2017-11-05T17:22:19.232193Z 0 [Note] WSREP: (d73a400b, 'tcp://0.0.0.0:4567') turning message relay requesting off

Server2:

2017-11-05T17:22:15.979415Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2017-11-05T17:22:15.981265Z 0 [Note] mysqld (mysqld 5.7.19-17-57-log) starting as process 1325 ...
2017-11-05T17:22:15.983852Z 0 [Warning] No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when this MySQL server acts as a master and has his hostname changed!! Please use '--log-bin=ip-172-31-13-81-bin' to avoid this problem.
2017-11-05T17:22:15.984031Z 0 [Note] WSREP: Setting wsrep_ready to false
2017-11-05T17:22:15.984043Z 0 [Note] WSREP: No pre-stored wsrep-start position found. Skipping position initialization.
2017-11-05T17:22:15.984047Z 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera3/libgalera_smm.so'
2017-11-05T17:22:16.013201Z 0 [Note] WSREP: wsrep_load(): Galera 3.22(r8678538) by Codership Oy <info@codership.com> loaded successfully.
2017-11-05T17:22:16.013252Z 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2017-11-05T17:22:16.015682Z 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootsrap: 0
2017-11-05T17:22:16.019709Z 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.31.13.81; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 4; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_count = 0; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 100; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 7; socket.checksum = 2; socket.recv_buf_size = 212992;
2017-11-05T17:22:16.034431Z 0 [Note] WSREP: GCache history reset: cc6a7fb6-c24b-11e7-bd13-ba1518e34f61:0 -> 00000000-0000-0000-0000-000000000000:-1
2017-11-05T17:22:16.035026Z 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2017-11-05T17:22:16.035038Z 0 [Note] WSREP: Preparing to initiate SST/IST
2017-11-05T17:22:16.035041Z 0 [Note] WSREP: Starting replication
2017-11-05T17:22:16.035048Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2017-11-05T17:22:16.035159Z 0 [Note] WSREP: Using CRC-32C for message checksums.
2017-11-05T17:22:16.035214Z 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2017-11-05T17:22:16.036228Z 0 [Warning] WSREP: Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
2017-11-05T17:22:16.036250Z 0 [Note] WSREP: Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown
2017-11-05T17:22:16.036685Z 0 [Note] WSREP: GMCast version 0
2017-11-05T17:22:16.036828Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-11-05T17:22:16.036836Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-11-05T17:22:16.037997Z 0 [Note] WSREP: EVS version 0
2017-11-05T17:22:16.038072Z 0 [Note] WSREP: gcomm: connecting to group 'pxc-cluster', peer '172.31.5.172:,172.31.13.81:'
2017-11-05T17:22:16.039496Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') connection established to de905d70 tcp://172.31.13.81:4567
2017-11-05T17:22:16.039513Z 0 [Warning] WSREP: (de905d70, 'tcp://0.0.0.0:4567') address 'tcp://172.31.13.81:4567' points to own listening address, blacklisting
2017-11-05T17:22:16.039960Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') connection established to d73a400b tcp://172.31.5.172:4567
2017-11-05T17:22:16.040014Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2017-11-05T17:22:16.540329Z 0 [Note] WSREP: declaring d73a400b at tcp://172.31.5.172:4567 stable
2017-11-05T17:22:16.540836Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-11-05T17:22:16.540862Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,d73a400b,2)
memb {
        d73a400b,0
        de905d70,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2017-11-05T17:22:19.039641Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') connection to peer de905d70 with addr tcp://172.31.13.81:4567 timed out, no messages seen in PT3S
2017-11-05T17:22:19.539775Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-11-05T17:22:37.061048Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.31.5.172:4567
2017-11-05T17:22:38.061306Z 0 [Note] WSREP: (de905d70, 'tcp://0.0.0.0:4567') reconnecting to d73a400b (tcp://172.31.5.172:4567), attempt 0
2017-11-05T17:22:41.562259Z 0 [Note] WSREP: evs::proto(de905d70, OPERATIONAL, view_id(REG,d73a400b,2)) suspecting node: d73a400b
2017-11-05T17:22:41.562325Z 0 [Note] WSREP: evs::proto(de905d70, OPERATIONAL, view_id(REG,d73a400b,2)) suspected node without join message, declaring inactive
2017-11-05T17:22:42.562445Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,d73a400b,2)
memb {
        de905d70,0
        }
joined {
        }
left {
        }
partitioned {
        d73a400b,0
        }
)
2017-11-05T17:22:42.562525Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-11-05T17:22:42.562550Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,de905d70,3)
memb {
        de905d70,0
        }
joined {
        }
left {
        }
partitioned {
        d73a400b,0
        }
)
2017-11-05T17:22:47.071058Z 0 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2017-11-05T17:22:47.071334Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():158
2017-11-05T17:22:47.071356Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-11-05T17:22:47.071412Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1513: Failed to open channel 'pxc-cluster' at 'gcomm://172.31.5.172,172.31.13.81': -110 (Connection timed out)
2017-11-05T17:22:47.071421Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-11-05T17:22:47.071426Z 0 [ERROR] WSREP: Provider/Node (gcomm://172.31.5.172,172.31.13.81) failed to establish connection with cluster (reason: 7)
2017-11-05T17:22:47.071430Z 0 [ERROR] Aborting

2017-11-05T17:22:47.071433Z 0 [Note] Giving 0 client threads a chance to die gracefully
2017-11-05T17:22:47.071437Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
2017-11-05T17:22:47.071440Z 0 [Note] WSREP: Service disconnected.
2017-11-05T17:22:47.071443Z 0 [Note] WSREP: Waiting to close threads......
2017-11-05T17:22:52.072202Z 0 [Note] WSREP: Some threads may fail to exit.
2017-11-05T17:22:52.072285Z 0 [Note] Binlog end
2017-11-05T17:22:52.072905Z 0 [Note] mysqld: Shutdown complete

2017-11-05T17:22:52.322273Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2017-11-05T17:22:52.325253Z 0 [Warning] No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when this MySQL server acts as a master and has his hostname changed!! Please use '--log-bin=ip-172-31-13-81-bin' to avoid this problem.
2017-11-05T17:22:52.344397Z 0 [Warning] WSREP: Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
2017-11-05T17:22:52.346176Z 0 [Warning] WSREP: (f4348c81, 'tcp://0.0.0.0:4567') address 'tcp://172.31.13.81:4567' points to own listening address, blacklisting
2017-11-05T17:22:52.847728Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-11-05T17:23:22.856010Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():158
2017-11-05T17:23:22.856051Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-11-05T17:23:22.856128Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1513: Failed to open channel 'pxc-cluster' at 'gcomm://172.31.5.172,172.31.13.81': -110 (Connection timed out)
2017-11-05T17:23:22.856140Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-11-05T17:23:22.856147Z 0 [ERROR] WSREP: Provider/Node (gcomm://172.31.5.172,172.31.13.81) failed to establish connection with cluster (reason: 7)
2017-11-05T17:23:22.856151Z 0 [ERROR] Aborting

Best Answer

Yes Rick is right. Because all of nodes are down you have to bootstrap Cluster again.

You stopped all nodes gracefully.

If you know What is the last one (node) stopped, choice this node too bootstrap and then start another node.

But If you don't remember the last node which you have turned off, open grastate.dat (locate in mysql directory) from all nodes and compare seqno number inside and choice a node that has bigger number and start this node first. If you choice wrong node for bootstrap, another node has to remove all data and request to SST again. so be careful