MySQL Group Replication stuck on RECOVERING

MySQLUbuntu

I'm testing MySQL Group Replication.

Environment:
Microsoft Azure
Ubuntu 16.0.4

I've followed this guide:
https://www.digitalocean.com/community/tutorials/how-to-configure-mysql-group-replication-on-ubuntu-16-04

I ALSO, however, added a second NIC to the two machines I'm testing on, in order to have the replication take place on a separate LAN so the primary one is only for connection with the applications.

As far as I can tell, the nodes can see each other:

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | f0bcfc98-4255-11e8-b39f-000d3a1db637 | 10.3.1.4    |        3306 | ONLINE       |
| group_replication_applier | f8867ce8-4255-11e8-b03c-000d3a133573 | 10.3.1.5    |        3306 | RECOVERING   |
+---------------------------+--------------------------------------+-------------+-------------+--------------+

10.3.1.0/24 is my replication LAN. 10.3.0.0/24 is the application/primary LAN.

The problem is that when I add the second node, it stays at RECOVERING. After a while, it ends up removing the second node.

2018-05-01T15:01:18.067031Z 0 [Note] Plugin group_replication reported: 'getstart group_id e459a5fc'
2018-05-01T15:01:19.067227Z 11 [Note] Plugin group_replication reported: 'Only one server alive. Declaring this server as online within the replication group'
2018-05-01T15:01:19.067238Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 10.3.1.4:3306 on view 15251868790666823:1.'
2018-05-01T15:01:19.074671Z 0 [Note] Plugin group_replication reported: 'This server was declared online within the replication group'
2018-05-01T15:01:47.201473Z 0 [Note] Plugin group_replication reported: 'getstart group_id e459a5fc'
2018-05-01T15:01:52.684929Z 0 [Note] Plugin group_replication reported: 'Members joined the group: 10.3.1.5:3306'
2018-05-01T15:01:52.685016Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 10.3.1.4:3306, 10.3.1.5:3306 on view 15251868790666823:2.'

    2018-05-01T15:10:53.575405Z 0 [Note] Plugin group_replication reported: 'getstart group_id e459a5fc'
    2018-05-01T15:10:53.913923Z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: 10.3.1.5:3306'
    2018-05-01T15:10:53.914018Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 10.3.1.4:3306 on view 15251868790666823:3.'

Here's the contents of /etc/mysql/mysql.conf.d/mysqld.cnf

[mysqld_safe]
socket          = /var/run/mysqld/mysqld.sock
nice            = 0

[mysqld]
user            = mysql
pid-file        = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
basedir         = /usr
datadir         = /var/lib/mysql
tmpdir          = /tmp
log-error       = /var/log/mysql/error.log

gtid_mode = ON
enforce_gtid_consistency = ON
master_info_repository = TABLE
relay_log_info_repository = TABLE
binlog_checksum = NONE
log_slave_updates = ON
log_bin = binlog
binlog_format = ROW
transaction_write_set_extraction = XXHASH64
loose-group_replication_bootstrap_group = OFF
loose-group_replication_start_on_boot = OFF
loose-group_replication_ssl_mode = REQUIRED
loose-group_replication_recovery_use_ssl = 1

# Shared replication group configuration
loose-group_replication_group_name = "fe2d1be8-1ba8-403b-83df-f75711b2f3d1"
loose-group_replication_ip_whitelist = "10.3.1.0/24"
loose-group_replication_group_seeds = "10.3.1.4:33061,10.3.1.5:33061"

# Single or Multi-primary mode? Uncomment these two lines
# for multi-primary mode, where any host can accept writes
loose-group_replication_single_primary_mode = OFF
loose-group_replication_enforce_update_everywhere_checks = ON

# Host specific replication configuration
server_id = 1
bind-address = "10.3.0.23"
report_host = "10.3.1.4"
loose-group_replication_local_address = "10.3.1.4:33061"



lc-messages-dir = /usr/share/mysql
skip-external-locking
max_connections = 1000
table_open_cache=2048
sql_mode = "STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"

# By default we only accept connections from localhost
# bind-address  = 127.0.0.1
# Disabling symbolic-links is recommended to prevent assorted security risks
# symbolic-links=0

# * Fine Tuning
#
key_buffer_size = 64M
max_allowed_packet      = 768M
thread_stack            = 256K
thread_cache_size       = 64
join_buffer_size = 16M
sort_buffer_size = 8M

long_query_time = 2
log-queries-not-using-indexes
#
# The following can be used as easy to replay backup logs or for replication.
# note: if you are setting up a replication slave, see README.Debian about
#       other settings you may need to change.
#server-id              = 1
#log_bin                        = /var/log/mysql/mysql-bin.log
expire_logs_days        = 10
max_binlog_size   = 100M
#binlog_do_db           = include_database_name
#binlog_ignore_db       = include_database_name
#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
max_heap_table_size=2G
tmp_table_size=2G


innodb_write_io_threads=8
innodb_read_io_threads=8

innodb_log_file_size = 1G
innodb_buffer_pool_size = 50G
innodb_log_buffer_size = 20M
#wait_timeout = 28800
#wait_timeout = 20
#interactive_timeout = 28800
interactive_timeout = 120
innodb_flush_log_at_trx_commit=0
innodb_flush_method = O_DIRECT
innodb_buffer_pool_instances = 8
innodb_sort_buffer_size = 2M

Please let me know if I've missed adding some important info, and I'll supply it asap. NOTE that the config above is for the primary. Config for secondary is identical except for the different IPs.

One question I have is: Shouldn't the MEMBER_PORT be 33061 instead of 3306 since we set that in the config as the replication port?

Thanks in advance!

Best Answer

Please make sure that you have opened all the required ports used by MySQL group replication (as per your config):

1) 3306

2) 33061

Also, if the member is being removed from the MySQL cluster, there must be something in the mysql logs. Please provide the detailed mysql log.

I have setup a group replication cluster in production with 3 nodes and it is working fine.