Mysql – Race condition in thesql / percona innodb binlog for master -> slave

binlogMySQLperconareplication

I'm having a really hard time diagnosing and working around a race condition that is causing problems on my slave and knocking it offline due to a foreign key relationship. We're using MIXED format replication in a basic master -> slave relationship.

The situation is thus, we have two tables in question, a topics table and a stats table. The topics table is generated from a maintenance task that adds new topics as they're found if they don't already exist. The second table stats keeps stats on when the topic trended, and how often it trended, basically keeping hourly statistics on each topic.

The race condition is taking place when a new topic is added. The stats table has a foreign key relating to the topic. When a new topic is added, once the INSERT has been confirmed as successful via a boolean response, the stats class is invoked which registers the first 'trended' count.

On the master this is working as expected, the topic is added, then the first stat is registered. However on the slave, this is not the case, the replication is logging the error…

Error 'Cannot add or update a child row: a foreign key constraint
fails (`db`.`stats`, CONSTRAINT `stats_primary` FOREIGN KEY (`Topic_ID`)
REFERENCES `topics` (`Topic_ID`) ON DELETE CASCADE ON UPDATE CASCADE)' 
on query.

… which basically says that the topics table doesn't have the topic that the stats table is trying to refer to.

I don't understand why there is a race-condition at all, the INSERT waits for a confirmation before attempting to create/increment a stat record, why is the binlog not keeping the statements in order? I've tried adding sanity-checks all over the place, and there are no problems on the master, any suggestions?


EDIT 1

Here is the schema for the two tables…

-- ----------------------------
-- Table structure for `history`
-- ----------------------------
DROP TABLE IF EXISTS `history`;
CREATE TABLE `history` (
  `ai_col` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `Topic_ID` char(32) NOT NULL,
  `timestamp` int(10) unsigned NOT NULL,
  `trended` int(10) unsigned NOT NULL DEFAULT '0',
  `viewed` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`ai_col`),
  UNIQUE KEY `history_primary` (`Topic_ID`,`timestamp`) USING BTREE,
  KEY `history_timestamp` (`timestamp`) USING BTREE,
  KEY `history_trended` (`trended`) USING BTREE,
  KEY `history_viewed` (`viewed`) USING BTREE,
  CONSTRAINT `history_primary` FOREIGN KEY (`Topic_ID`) REFERENCES `topics` (`Topic_ID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;

-- ----------------------------
-- Table structure for `topics`
-- ----------------------------
DROP TABLE IF EXISTS `topics`;
CREATE TABLE `topics` (
  `ai_col` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `Topic_ID` char(32) NOT NULL,
  `terms` varchar(255) NOT NULL,
  `title` varchar(100) DEFAULT NULL,
  `link` varchar(255) DEFAULT NULL,
  `type` enum('event','profile','timeline','general') DEFAULT NULL,
  `event_date` date DEFAULT NULL,
  `description` text,
  `active` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `created` datetime NOT NULL,
  `lastedit` datetime NOT NULL,
  PRIMARY KEY (`ai_col`),
  UNIQUE KEY `topics_topic_id` (`Topic_ID`) USING BTREE,
  KEY `topics_link` (`link`) USING BTREE,
  KEY `topics_title` (`title`) USING BTREE,
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

Here are a few sample rows, including the one in particular that caused this specific error, however keep in mind, this error is happening as soon as the log is completely caught up (0s behind) and is taking place with NEW data. Also keep in mind, these are from the dump file so they're all inserts rather than updates, note the higher 'trended' values)…

INSERT INTO `topics` VALUES ('479410', '827658485a914809b6759d28e6fd7abe', 'd40bad723b2ccc7706bd612760ccd290', '07b2f3037d844db587ebbc5df4685820', 'PC Organics', 'Trending on Twitter,February 2017', 'PC Organics', 'Pc-Organics', 'general', null, 'Stories and events relating to PC Organics', '1', '2017-02-04 11:34:47', '2017-02-04 11:34:47');
INSERT INTO `topics` VALUES ('479411', 'd1fdb23fdf4949ae86441c8ae21573e0', 'd40bad723b2ccc7706bd612760ccd290', '07b2f3037d844db587ebbc5df4685820', 'Tom Lawrence', 'Trending on Twitter,February 2017', 'Tom Lawrence', 'Tom-Lawrence', 'general', null, 'Stories and events relating to Tom Lawrence', '1', '2017-02-04 11:34:47', '2017-02-04 11:34:47');
INSERT INTO `topics` VALUES ('479412', '74180ea91a2542eebcbe596b90311360', 'd40bad723b2ccc7706bd612760ccd290', '07b2f3037d844db587ebbc5df4685820', 'Moore', 'Trending on Twitter,February 2017', 'Moore', 'Moore', 'general', null, 'Stories and events relating to Moore', '1', '2017-02-04 11:34:47', '2017-02-04 11:34:47');

INSERT INTO `history` VALUES ('3647775', '827658485a914809b6759d28e6fd7abe', '1486224000', '4', '0');
INSERT INTO `history` VALUES ('3647776', 'd1fdb23fdf4949ae86441c8ae21573e0', '1486224000', '3', '0');
INSERT INTO `history` VALUES ('3647778', '74180ea91a2542eebcbe596b90311360', '1486224000', '3', '0');

And this is the my.cnf ansible template file with common, master, and slave components indicated …

[client]
socket=/mnt/data/mysql.sock

[mysqld]
datadir=/mnt/data/
socket=/mnt/data/mysql.sock
user=mysql
symbolic-links=0
port=3306
bind-address=0.0.0.0
server-id = {{ mysql_db_id }}
skip-name-resolve
character-set-server=utf8
collation-server=utf8_general_ci
explicit_defaults_for_timestamp = 1
ignore-db-dir=lost+found

# Logs
slow_query_log = OFF
#slow_query_log_file=/mnt/data/mysql-slow.log
#long_query_time= 3
log-error=/mnt/data/mysqld.log

# Limits
max_allowed_packet = 256M
tmp-table-size  = 32M
max-heap-table-size = 32M
query-cache-type  = 0
query-cache-size  = 0
max-connections = 1000
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache  = 4096
table-open-cache = 4096

{% if mysql_repl_role == 'master' %}
# Master
log_bin = mysql-bin
log-bin-index = mysql-bin.index
expire_logs_days = 10
max_binlog_size = 100M
max_binlog_files = 50
binlog_do_db = {{ database.name }}
binlog_format = MIXED
{% endif %}

{% if mysql_repl_role == 'slave' %}
# Slave
read_only
relay-log = relay-bin
relay-log-index = relay-bin.index
replicate-ignore-table = "db.sessions"
replicate-ignore-table = "db.cache"
{% endif %}

# INNODB
innodb-flush-method            = O_DIRECT
innodb-log-files-in-group      = 2
innodb-log-file-size           = 256M
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table          = 1
innodb-buffer-pool-size        = {{ innodb_buffer_pool_size }}

# HANDLERSOCKET PLUGIN
loose_handlersocket_port = 9998
loose_handlersocket_port_wr = 9999
loose_handlersocket_threads = 8
loose_handlersocket_threads_wr = 1

[mysqldump]
quick
quote-names
max_allowed_packet = 128M

[mysqld_safe]
log-error=/mnt/data/mysqld.log
pid-file=/mnt/data/mysqld.pid

Everything starts fine, and the rest of the system is able to exist happily with the replication in place. The ONLY issue is with the two tables outlined above, and we're talking about 200+ tables with most tables having multiple FK relationships. I've gone so far as ensuring I stopped using any non-ANSI compatible / MySQL specific query functions such as ... ON DUPLICATE KEY UPDATE ... which I know has replication issues, but haven't hit on anything yet.


EDIT 2: binlog

Here is the statement in question causing the Foreign Key error above…

# at 1403605
#170204 12:05:01 server id 1  end_log_pos 1404039 CRC32 0x8dd8f69e      Query   thread_id=1120  exec_time=0     error_code=0
SET TIMESTAMP=1486227901/*!*/;
BEGIN
/*!*/;
# at 1403692
# at 1403724
#170204 12:05:01 server id 1  end_log_pos 1404071 CRC32 0x116b9b3a      Intvar
SET INSERT_ID=3647784/*!*/;
#170204 12:05:01 server id 1  end_log_pos 1404270 CRC32 0x45ec23e8      Query   thread_id=1120  exec_time=0     error_code=0
SET TIMESTAMP=1486227901/*!*/;
INSERT INTO history SET Topic_ID = '827658485a914809b6759d28e6fd7abe', `timestamp` = 1486227600, trended = 1
/*!*/;
# at 1403923
#170204 12:05:01 server id 1  end_log_pos 1404301 CRC32 0xe396df1e      Xid = 157330
COMMIT/*!*/;

The SHOW SLAVE STATUS \G is showing the following (which points to the above)…

Relay_Log_File: relay-bin.000003
Relay_Log_Pos: 1403605

And just to recap, here is the full error listed under Last_SQL_Error

 Last_SQL_Error: Error 'Cannot add or update a child row: a foreign key constraint fails (`db`.`history`, CONSTRAINT `history_primary` FOREIGN KEY (`Topic_ID`) REFERENCES `topics` (`Topic_ID`) ON DELETE CASCADE ON UPDATE CASCADE)' on query. Default database: 'db'. Query: 'INSERT INTO history SET Topic_ID = '827658485a914809b6759d28e6fd7abe', `timestamp` = 1486227600, trended = 1'

Best Answer

So in the end, @jynus suggested in the OP comments that in their experience, replication issues like this often came down to the filters and unexpected behaviours resulting from them which seems to have led to a solution...

Because we needed the filters for a number of reasons, instead of eliminating them, I tried to be verbose with our declarations which led to explicitly declaring what DB to replicate using replicate-do-db in the my.cnf slave definition, just before the replicate-ignore-table declarations.

While I don't find it very logical since all db's would have been replicated by not delcaring a specific one, it seems that despite logic, this fixed our issue. The slave is at 5hrs uptime with no issue.

I'll add any relevant info if I come across any.