MySQL Replication Slave Status OK but Data inconsistent

master-slave-replicationMySQLmysql-5.7replication

We are using MySQL 5.7, and have a Master and 3 Slave servers replication setup. All the 4 servers are dedicated for MySQL use only, and belong to the same data-center, connected over Private Network. So I believe that the network issues for replication should be minimal.

So, I did SHOW SLAVE STATUS\G on the Slave 1 server:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: <removed for anonymity>
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.004803
          Read_Master_Log_Pos: 885020002
               Relay_Log_File: ubuntu-s-4vcpu-8gb-blr1-slave02-relay-bin.001056
                Relay_Log_Pos: 885020215
        Relay_Master_Log_File: mysql-bin.004803
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 885020002
              Relay_Log_Space: 885020495
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 103
                  Master_UUID: 78e93502-d3af-11e7-9af0-5aaa686987ef
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

It clearly states that everything is fine. We also have quite a bit of monitoring tools like Zabbix, Packetbeat (Kibana) setup; which also gives us alerts if there is lag in the replication in any of the three Slave servers. We did not get any alerts about possible replication issues as well.

Now, I ran a query on Slave 1 server:

mysql> select order_id, order_status_id, buyer_invoice_id 
       from oc_suborder where order_id = 62284;
+----------+-----------------+------------------+
| order_id | order_status_id | buyer_invoice_id |
+----------+-----------------+------------------+
|    62284 |               1 |             NULL |
|    62284 |              15 |            76729 |
+----------+-----------------+------------------+
2 rows in set (0.01 sec)

It returns two rows in this case, which is wrong.

I ran the same query on the Master Server:

mysql> select order_id, order_status_id, buyer_invoice_id 
       from oc_suborder where order_id = 62284;
+----------+-----------------+------------------+
| order_id | order_status_id | buyer_invoice_id |
+----------+-----------------+------------------+
|    62284 |              15 |            76729 |
+----------+-----------------+------------------+
1 row in set (0.18 sec)

It returns one row only, which is correct; so the Slave is definitely having inconsistent data despite showing Slave Status as OK.

I also performed some DML operations on Master, and those changes did get replicated on the Slave server; so the replication is running. We are using third party DBA services, and every-time this happens; all they would suggest is to rebuild the Slave server(s). We are already sitting at 100 GB+ data, and it takes quite a few hours for this process (they have already done it 4 times in last few months); but the data inconsistency keeps on popping up time and again.

My questions are manifold:

  1. Is there any way to simply identify the data mismatch related tables/rows only, across multiple Schemas (Databases); and fix these specific rows instead of rebuilding the Slave again. My doubt is that I may have multiple tables with this data mismatch. We discover these mismatches only when some specific report contains suspicious data.

  2. What could be the possible reasons of the data mismatch appearing; even when Slave is 0 seconds behind Master. The example table in the queries above is using InnoDB. There are a few Transactions running on that table, and I am starting to suspect that a Transaction gets successfully committed on the Master, but not on the Slave sometimes.

  3. Using SHOW GLOBAL VARIABLES LIKE 'binlog_format';, I found that binlog_format on my Master server is set as MIXED, while on the Slave server it is set as ROW. Can this be source of this error or some other undiscovered errors ?

Please bear with me if the question seems Too Broad (as per Stack Overflow rules). I will be happy to provide as many details as needed, or Edit the question down to specific details. I am regular Stack Overflow user, and posting first time here. Any pointers would be really helpful.

Edit #1:

Are all the slaves configured with read-only=1? Is there any chance
that your application could be writing directly to the slaves?

Yes. Also, in our application code, we do have an elaborate code for Read-Write separation. Infact, all the queries inside a Transaction (START TRANSACTION .. COMMIT/ROLLBACK) are always routed to the Master Server, even if there are SELECT queries within this block. Even, to cover for replication lag (sometimes), we route the next 2 SELECT queries after a DML operation to Master only.

Are any of the slaves configured with slave-skip-errors by any chance?
If so, what is the value of this parameter?

Please find below:

mysql> select @@slave_skip_errors;
+---------------------+
| @@slave_skip_errors |
+---------------------+
| 1032,1062           |
+---------------------+
1 row in set (0.10 sec)

Edit #2

Please provide SHOW CREATE TABLE oc_suborder for both the Master and
the Slave with the problem you detailed.

Master Server:

mysql> show create table oc_suborder\G;
*************************** 1. row ***************************
       Table: oc_suborder
Create Table: CREATE TABLE `oc_suborder` (
  `order_id` int(11) NOT NULL,
  `suborder_id` varchar(20) NOT NULL,
  `gst` tinyint(1) NOT NULL DEFAULT '1',
  `buyer_invoice_id` int(11) unsigned DEFAULT NULL,
  `invoice_no` int(11) NOT NULL DEFAULT '0',
  `invoice_date` datetime DEFAULT NULL,
  `invoice_prefix` varchar(26) NOT NULL,
  `shipping_method` varchar(128) NOT NULL,
  `shipping_code` varchar(128) NOT NULL,
  `total` decimal(15,4) DEFAULT NULL,
  `order_status_id` int(11) NOT NULL,
  `date_added` datetime NOT NULL,
  `date_modified` datetime NOT NULL,
  `cform_submit` enum('no_submit','will_submit','submitted','') NOT NULL DEFAULT 'no_submit',
  `cst_with_cform` decimal(15,4) DEFAULT NULL,
  `refundable_cform` decimal(15,4) DEFAULT NULL,
  `refund_status` enum('refunded','not_refunded','not_applicable','') NOT NULL DEFAULT 'not_applicable',
  `courier_partner` varchar(100) DEFAULT NULL,
  `tracking_no` varchar(64) DEFAULT NULL,
  `shipping_charge` decimal(15,2) DEFAULT NULL,
  `custom_totals` text,
  `no_wsb_tape` tinyint(1) NOT NULL DEFAULT '0',
  `no_invoice_with_shipment` tinyint(1) NOT NULL DEFAULT '0',
  `courier_partner_preference` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`suborder_id`),
  UNIQUE KEY `order_id` (`order_id`,`suborder_id`),
  UNIQUE KEY `buyer_invoice_id` (`buyer_invoice_id`),
  KEY `invoice_date` (`invoice_date`),
  KEY `shipping_method` (`shipping_method`),
  KEY `order_status_id` (`order_status_id`),
  KEY `date_added` (`date_added`),
  KEY `tracking_no` (`tracking_no`),
  CONSTRAINT `oc_suborder_order_id_fk` FOREIGN KEY (`order_id`) REFERENCES `oc_order` (`order_id`) ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.08 sec)

Slave server:

mysql> show create table oc_suborder\G;
*************************** 1. row ***************************
       Table: oc_suborder
Create Table: CREATE TABLE `oc_suborder` (
  `order_id` int(11) NOT NULL,
  `suborder_id` varchar(20) NOT NULL,
  `gst` tinyint(1) NOT NULL DEFAULT '1',
  `buyer_invoice_id` int(11) unsigned DEFAULT NULL,
  `invoice_no` int(11) NOT NULL DEFAULT '0',
  `invoice_date` datetime DEFAULT NULL,
  `invoice_prefix` varchar(26) NOT NULL,
  `shipping_method` varchar(128) NOT NULL,
  `shipping_code` varchar(128) NOT NULL,
  `total` decimal(15,4) DEFAULT NULL,
  `order_status_id` int(11) NOT NULL,
  `date_added` datetime NOT NULL,
  `date_modified` datetime NOT NULL,
  `cform_submit` enum('no_submit','will_submit','submitted','') NOT NULL DEFAULT 'no_submit',
  `cst_with_cform` decimal(15,4) DEFAULT NULL,
  `refundable_cform` decimal(15,4) DEFAULT NULL,
  `refund_status` enum('refunded','not_refunded','not_applicable','') NOT NULL DEFAULT 'not_applicable',
  `courier_partner` varchar(100) DEFAULT NULL,
  `tracking_no` varchar(64) DEFAULT NULL,
  `shipping_charge` decimal(15,2) DEFAULT NULL,
  `custom_totals` text,
  `no_wsb_tape` tinyint(1) NOT NULL DEFAULT '0',
  `no_invoice_with_shipment` tinyint(1) NOT NULL DEFAULT '0',
  `courier_partner_preference` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`suborder_id`),
  UNIQUE KEY `order_id` (`order_id`,`suborder_id`),
  UNIQUE KEY `buyer_invoice_id` (`buyer_invoice_id`),
  KEY `invoice_date` (`invoice_date`),
  KEY `shipping_method` (`shipping_method`),
  KEY `order_status_id` (`order_status_id`),
  KEY `date_added` (`date_added`),
  KEY `tracking_no` (`tracking_no`),
  CONSTRAINT `oc_suborder_order_id_fk` FOREIGN KEY (`order_id`) REFERENCES `oc_order` (`order_id`) ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.07 sec)

Best Answer

  1. You can use pt-table-checksum and pt-table-sync to fix the data mismatch between the master and the slave. Here's a blog post for further reference: https://www.percona.com/blog/2015/08/12/mysql-replication-primer-with-pt-table-checksum-and-pt-table-sync/

  2. It is possible that an update went to the slave directly instead of the master. It can also be with the use of sql_log_bin=0 on the master. This can happen but you can prevent it by setting the slave with read_only=1 or super_read_only=1