Mysql – MariaDB Parallel Replication Drifting at random times

galeralinuxmariadbMySQLreplication

I have recently been thrown in the deep-end to manage our Database servers since our DBA left.

The current set-up is Mariadb (mariadb Ver 15.1 Distrib 10.1.38-MariaDB, for debian-linux-gnu), InnoDB (I turned on innodb_file_per_table as we had some 200+GB ibdata1 files.). We use parallel (conservative) replication.

a little bit of the config (just without ssl certs)

innodb_file_per_table = On
innodb_thread_concurrency = 0
innodb_buffer_pool_size = 40G
innodb_buffer_pool_instances = 20
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
table_open_cache = 8192
thread_cache_size = 256
table_cache = 70000
open_files_limit = 100000
log_slave_updates
collation-server = utf8_general_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
## Logging
log-output = FILE
slow-query-log = 1
slow-query-log-file = /var/lib/mysql/slow-log
log-slow-slave-statements
long-query-time = 30
log_warnings = 2

I have noticed that sometimes our servers drift but this usually corrects itself as I am assuming a large statement(s), one of the reasons I turned on Logging. But some servers will just not correct themselves and one of these only has a 2.6G ibdata1 file, I have restarted slave a few times but its just very slow. I am good on the Linux and Hardware side and ensured there is no other problems with disks. The network link between the servers via web, each server has 1Gbps Connection though and we see no latency ~6ms ping and stable. Some servers drift when there is no load on the servers, like the slave is not requesting from the master and it certainly doesn't seem to use full bandwidth. Overnight one of our servers has gone from being synced to 3 hours behind and seems to be increasing. I cannot see any issue on other database or even any users logged in to our application. could something be locking it from replicating?

Any help or suggestions would be really appreciated as I am kind of thrown in the deep-end here, I am not a DBA but whilst we do not currently have one I am the "best fit" and we need to get these servers replicated correctly for production.

If anyone also has any suggestions on learning resources, I think maybe this setup needs to be started from scratch, we do not use GTID for example, I wonder if we even use the correct replication method.

Many Thanks.

Best Answer

innodb_file_per_table = On  -- only applies during CREATE TABLE or ALTER TABLE
table_open_cache = 8192
table_cache = 70000  -- old name for table_open_cache; 70K is too high
init-connect='SET NAMES utf8'  -- keep in mind that user=root skips init_connect
long-query-time = 30  -- so high as to be virtually useless

"one of these only has a 2.6G ibdata1 file" -- What do you mean?

"But some servers will just not correct" -- Do you mean 'some Slaves'? Or 'some Clients'?

"no latency ~6ms ping and stable" -- that implies several hundred miles or km.

"Some servers drift" -- clocks drift; servers don't. What do you mean?

"3 hours behind" -- Do SHOW SLAVE STATUS;

Replication existed for more than a decade before GTID was added. That won't be to blame for what you are seeing.

Step 01

On ServerB, run the following commands

STOP SLAVE;
SET GLOBAL innodb_max_dirty_pages_pct = 0;
FLUSH TABLES;

Step 02

On ServerC, run SHOW SLAVE STATUS\G

Repeat running SHOW SLAVE STATUS\G until Seconds_Behind_Master is 0

Then, run SET GLOBAL innodb_max_dirty_pages_pct = 0;

Step 03

On ServerB, run SHOW SLAVE STATUS\G

For the sake of example, let say SHOW SLAVE STATUS\G looks like this:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 10.64.68.253
                Master_User: replusername
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysql-bin.003202
        Read_Master_Log_Pos: 577991837
             Relay_Log_File: relay-bin.010449
              Relay_Log_Pos: 306229695
      Relay_Master_Log_File: mysql-bin.003202
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table: 
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 577991837
            Relay_Log_Space: 306229695
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 0

Please note the following:

Master_Host (10.64.68.253)
Master_User (replusername)
Relay_Master_Log_File (mysql-bin.003202)
Exec_Master_Log_Pos (577991837)

Step 04

On ServerC, run the following:

STOP SLAVE;
SET GLOBAL innodb_max_dirty_pages = 0;
FLUSH TABLES;
CHANGE MASTER TO
MASTER_HOST='10.64.68.253',
MASTER_PORT=3306,
MASTER_USER='replusername',
MASTER_PASSWORD='replpassword',
MASTER_LOG_FILE='mysql-bin.003202',
MASTER_LOG_POS=577991837;
START SLAVE;

Step 05

On ServerC, run SHOW SLAVE STATUS\G

If Seconds_Behind_Master is a Number, CONGRATULATIONS !!!

Step 06

On ServerB, run SET GLOBAL innodb_max_dirty_pages_pct = 90;

On ServerC, run SET GLOBAL innodb_max_dirty_pages_pct = 90;

Give it a Try !!!

CAVEAT

If the majority of your data is MyISAM, ignore all commands that change innodb_max_dirty_pages_pct.

Mysql – slow writes on percona / mariadb server only when replication is running

You noted on the Percona discussion group that you found the culprit to be the thread pool. Then Laurynas Biveinis replied: Jul 16 (7 days ago)

Ah. Your my.cnf has thread_pool_size = 2 which seems too low and possibly explains the poor performance of thread pool. If you want to use it, make sure to tune its settings, http://www.percona.com/doc/percona-server/5.6/performance/threadpool.html