Mysql – MariaDB Parallel Replication Drifting at random times

galeralinuxmariadbMySQLreplication

I have recently been thrown in the deep-end to manage our Database servers since our DBA left.

The current set-up is Mariadb (mariadb Ver 15.1 Distrib 10.1.38-MariaDB, for debian-linux-gnu), InnoDB (I turned on innodb_file_per_table as we had some 200+GB ibdata1 files.). We use parallel (conservative) replication.

a little bit of the config (just without ssl certs)

innodb_file_per_table = On
innodb_thread_concurrency = 0
innodb_buffer_pool_size = 40G
innodb_buffer_pool_instances = 20
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
table_open_cache = 8192
thread_cache_size = 256
table_cache = 70000
open_files_limit = 100000
log_slave_updates
collation-server = utf8_general_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
## Logging
log-output = FILE
slow-query-log = 1
slow-query-log-file = /var/lib/mysql/slow-log
log-slow-slave-statements
long-query-time = 30
log_warnings = 2

I have noticed that sometimes our servers drift but this usually corrects itself as I am assuming a large statement(s), one of the reasons I turned on Logging. But some servers will just not correct themselves and one of these only has a 2.6G ibdata1 file, I have restarted slave a few times but its just very slow. I am good on the Linux and Hardware side and ensured there is no other problems with disks. The network link between the servers via web, each server has 1Gbps Connection though and we see no latency ~6ms ping and stable. Some servers drift when there is no load on the servers, like the slave is not requesting from the master and it certainly doesn't seem to use full bandwidth. Overnight one of our servers has gone from being synced to 3 hours behind and seems to be increasing. I cannot see any issue on other database or even any users logged in to our application. could something be locking it from replicating?

Any help or suggestions would be really appreciated as I am kind of thrown in the deep-end here, I am not a DBA but whilst we do not currently have one I am the "best fit" and we need to get these servers replicated correctly for production.

If anyone also has any suggestions on learning resources, I think maybe this setup needs to be started from scratch, we do not use GTID for example, I wonder if we even use the correct replication method.

Many Thanks.

Best Answer

innodb_file_per_table = On  -- only applies during CREATE TABLE or ALTER TABLE
table_open_cache = 8192
table_cache = 70000  -- old name for table_open_cache; 70K is too high
init-connect='SET NAMES utf8'  -- keep in mind that user=root skips init_connect
long-query-time = 30  -- so high as to be virtually useless

"one of these only has a 2.6G ibdata1 file" -- What do you mean?

"But some servers will just not correct" -- Do you mean 'some Slaves'? Or 'some Clients'?

"no latency ~6ms ping and stable" -- that implies several hundred miles or km.

"Some servers drift" -- clocks drift; servers don't. What do you mean?

"3 hours behind" -- Do SHOW SLAVE STATUS;

Replication existed for more than a decade before GTID was added. That won't be to blame for what you are seeing.