Mariadb – Why has deadlock behaviour changed between MariaDB 10.1.22 and 10.2.14

deadlockmariadbmaster-slave-replicationreplication

Since upgrading MariaDB 10.1.22 to 10.2.14 our MariaDB slaves are encountering deadlocks that are not handled in less than 600 seconds thus the classic semaphore decision to crash the server. The server has crashed 3 times. The extremely high volumes have not changed; only the MDB performance has improved with the upgrade.

Note we have Insert on Duplicate Updates that process super high volumes on our master. The deadlocks on same queries occur on the slaves so it has to be related to the slave parallel replication locking. Reducing slave_parallel_workers has mitigated some of this.

In summary looking to understand what has changed with mdb 10.2.x regarding threads, timeouts, etc. to zoom in on this issue. Why MDB is unable to determine the deadlock and rollback one of the offending transactions.

I ACKNOWLEDGE all deadlocks should be addressed but as stated above they are not occuring on the master, only on the slave for same statements.

We had the deadlocks prior to the upgrade but MDB always managed same with NO problems.

2018-06-11 10:32:02 139519224362752 [Note] InnoDB: A semaphore wait:
–Thread 139518736328448 has waited at read0read.cc line 579 for 910.00 seconds the semaphore: Mutex at 0x7f2b63dc13a0, Mutex TRX_SYS created trx0sys.cc:554, lock var 2

2018-06-11 10:32:02 139519224362752 [Note] InnoDB: A semaphore wait:
–Thread 139518749968128 has waited at dict0dict.cc line 1160 for 910.00 seconds the semaphore: Mutex at 0x7f2b63dcb500, Mutex DICT_SYS created dict0dict.cc:1096, lock var 2

2018-06-11 10:32:02 139519224362752 [Note] InnoDB: A semaphore wait:
–Thread 139518750574336 has waited at dict0dict.cc line 1160 for 890.00 seconds the semaphore: Mutex at 0x7f2b63dcb500, Mutex DICT_SYS created dict0dict.cc:1096, lock var 2

InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic
info: InnoDB: Pending reads 2, writes 0 InnoDB: ###### Diagnostic info
printed to the standard error stream 2018-06-11 10:32:32
139519224362752 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted >
600 seconds. We intentionally crash the server because it appears to
be hung. 180611 10:32:32 [ERROR] mysqld got signal 6 ;

Best Answer

Answer originally left in comments by the question author

Finally I discovered mdb addressed this bug in 10.2.13 but it remains in 10.2.14.

In summary have gotten around the problem by turning INNODB_ADAPTIVE_HASH_INDEX = ON and have not had a problem since Monday AM. See https://jira.mariadb.org/browse/MDEV-14441 for the supposed 10.2.13 fix.

Evidently MDB was bypassing releasing the latches associated with the semaphores when deadlocks occurred, when INNODB_ADAPTIVE_HASH_INDEX = OFF

The mdb 10.2.13 fix explains mdb hanging during deadlocks in 10.2.13. The root cause of this bug is that in the function btr_cur_update_in_place() we are skipping this call if the adaptive hash index was disabled during the execution:

if (block->index) {
    btr_search_x_unlock(index);

When debugging the code, I mistook the leaked X lock for a leaked S lock. I did not find any other rw-lock leaks during the MDEV-14952 review/refactoring effort.

The lock leak was introduced by me in MDEV-12121.