Mysql – Mariadb-10.0.30 Replication drift

mariadbMySQLreplication

We've configured 4 mariadb 10.0.30 slaves, the master also has the same major version and about the same hardware (32 cores, 128Gbs ram, 4.4Tb of data on ssd raid), everything was running fine until we noticed some data was missing in the slaves (that's when we read all over data drift in mysql/mariadb replication), after reading thousands of articles and keep trying to find what happened we noticed that the data is missing (in the slaves) but the schemas are being updated, for example if I created a new db with some tables it will show up in the slaves but not the data that I could potentially insert, the same it's been happening to the rest of the databases/tables (and there are thousands), however we see that no errors are being displayed with SHOW SLAVE STATUS\G, but the replication is partially working, we were analyzing the files using mysqlbinlog in the slaves to get to the conclusion that data wasn't being sent to the slaves, as there is a lot of data it's a big effort to reconfigure/copy and start one slave again (also knowing that it could get into the same issue in a few weeks)

Here are some config parameters of interest:

Config:

[mariadb]

#InnoDB
innodb_buffer_pool_populate     = 1
innodb_buffer_pool_size         = 95G
innodb_file_per_table           = 1
innodb_corrupt_table_action     = warn
innodb_support_xa               = 1
innodb_open_files               = 100000
innodb_flush_neighbors          = 0
innodb_write_io_threads         = 64
innodb_log_file_size            = 1G
innodb_log_files_in_group       = 4
innodb_log_buffer_size          = 128M
innodb_flush_log_at_trx_commit  = 0
innodb_flush_method             = O_DIRECT
innodb_io_capacity              = 2000
innodb_buffer_pool_instances    = 8
innodb_stats_on_metadata        = 0

wait_timeout                    = 30

# General
datadir                         = /var/lib/mysql/data
tmpdir                          = /var/lib/mysql/tmp
socket                          = /var/lib/mysql/mysql.sock
skip-name-resolve
sql-mode                        = NO_ENGINE_SUBSTITUTION
open_files_limit                = 100000
max_connections                 = 1300

#Disable query cache
query-cache-size                = 0M
query-cache-limit               = 1M

# Recommended by Percona
table_open_cache                = 8192
thread_cache_size               = 256

## Replication
server-id                       = 12345
log-bin                         = /var/lib/mysqllogs/master-bin-log
relay-log                       = /var/lib/mysqllogs/master-relay-log
relay-log-space-limit           = 16G
slave_parallel_threads          = 8
read-only                       = 0
log-slave-updates               = 1
binlog-format                   = ROW
skip_slave_start

## Logging
log-output                      = FILE
slow-query-log                  = 1
slow-query-log-file             = /var/lib/mysql/slow-log
log-slow-slave-statements
long-query-time                 = 2

Slave config (only different values, skipping server-id, log-bin, relay-log)

read-only                       = 1
sync-binlog                     = 1
slave_skip_errors               = 1062

Server ids for each slave:

server-id                       = 5051
server-id                       = 36512
server-id                       = 71983
server-id                       = 92674

Master status (Note that the master status was added several hours after the slave status):

MariaDB [(none)]> show master status;
+--------------------+----------+--------------+------------------+
| File               | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+--------------------+----------+--------------+------------------+
| master-bin-log.021135 | 13151964 |              |                  |
+--------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

Slave status:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.10.10.1
                  Master_User: replusr
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: master-bin-log.021114
          Read_Master_Log_Pos: 711412247
               Relay_Log_File: slave1-relay-log.000044
                Relay_Log_Pos: 711412536
        Relay_Master_Log_File: master-bin-log.021114
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 711412247
              Relay_Log_Space: 711412876
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 95724
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-95724-49194074113
1 row in set (0.00 sec)

The data/inserts are in the master's binlog:

180703 16:16:22 server id 95724  end_log_pos 597993092         GTID 0-95724-49190481089
/*!100001 SET @@session.gtid_seq_no=49190481089*//*!*/;
BEGIN
/*!*/;
# at 597993092
# at 597993139
#180703 16:16:22 server id 95724  end_log_pos 597993139         Table_map: `test`.`dummy` mapped to number 4871549419
#180703 16:16:22 server id 95724  end_log_pos 597993238         Write_rows: table id 4871549419 flags: STMT_END_F
### INSERT INTO `test`.`dummy`
### SET
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2='a' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=2 /* INT meta=0 nullable=0 is_null=0 */
###   @2='b' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='c' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=4 /* INT meta=0 nullable=0 is_null=0 */
###   @2='d' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=5 /* INT meta=0 nullable=0 is_null=0 */
###   @2='e' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=6 /* INT meta=0 nullable=0 is_null=0 */
###   @2='f' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=7 /* INT meta=0 nullable=0 is_null=0 */
###   @2='g' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=8 /* INT meta=0 nullable=0 is_null=0 */
###   @2='h' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=9 /* INT meta=0 nullable=0 is_null=0 */
###   @2='i' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=10 /* INT meta=0 nullable=0 is_null=0 */
###   @2='j' /* STRING(5) meta=65029 nullable=1 is_null=0 */
# at 597993238
#180703 16:16:22 server id 95724  end_log_pos 597993265         Xid = 60516562879
COMMIT/*!*/;

Also the data is in the relay log of the slaves:

# at 711212911
#180705 10:48:55 server id 95724  end_log_pos 711212660         GTID 0-95724-49248341051
/*!100001 SET @@session.gtid_seq_no=49248341051*//*!*/;
BEGIN
/*!*/;
# at 711212949
# at 711212996
#180705 10:48:55 server id 95724  end_log_pos 711212707         Table_map: `test`.`dummy` mapped to number 4982198060
#180705 10:48:55 server id 95724  end_log_pos 711212743         Write_rows: table id 4982198060 flags: STMT_END_F
### INSERT INTO `test`.`dummy`
### SET
###   @1=13
###   @2='m'
# at 711213032
#180705 10:48:55 server id 95724  end_log_pos 711212770         Xid = 62156308933
COMMIT/*!*/;

We also investigated percona tools like pt-table-checksum and pt-table-sync but they are useless if the data is not getting inserted in the binlog, it could be a bug?

The config for the slaves is similar to what's displayed here, so my questions are:

Why mysql keeps replicating schemas but not data?
Is there any way to fix the replication without having to start over again?
Any other idea?


Update:

Last night we stopped the master server, and all 4 slaves, after starting the master again we started 2 slaves which throwed this error:

           Last_Error: Could not execute Update_rows_v1 event on table example_db.orders; Can't find record in 'orders', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin-log.021343, end_log_pos 10929

After that, as a test I added slave_skip_errors = 1032 in the other 2 slaves and started the slaves, these 2 slaves started to sync correctly and also the data started to get inserted in the tables (only new data), of course this is a totally corrupted slave at this point, but it took a master restart and a slave_skip_errors = 1032 param to recover from the no data issue thing, the problem here is the silent replication break, if we could at least tell when it breaks without having to implement any custom script or checking on a daily basis. And also as we use binlog-format = ROW we cannot make use of pt-table-checksum and pt-table-sync to resync in such case.

Update2:

MariaDB [(none)]> SELECT @@innodb_lru_scan_depth, @@query_cache_type ;
+-------------------------+--------------------+
| @@innodb_lru_scan_depth | @@query_cache_type |
+-------------------------+--------------------+
|                    1024 | ON                 |
+-------------------------+--------------------+
1 row in set (0.00 sec)

Thanks.

Best Answer

Suggestions to consider for your my.cnf-ini [mysqld] section

query_cache_limit=0K  # from 1M to free RAM, QC not used
query_cache_type==0  # to ensure it is OFF and save CPU cycles
innodb_lru_scan_depth=128  # from 1024 to conserve CPU cycles every second

the best to you and your team.