Mysql – Percona XtraDB – Random deadlock

galeraMySQLperconaxtradb-cluster

Randomly my database gets locked completely and queries hang and they pile up until it gets "Too many connections".

Below is my mysql log file, I couldn't find anything related to the error system call.

Aug 28 14:06:36 db1a mysqld: 2015-08-28 14:06:36 11786 [Note] WSREP: Created page /var/data/mysql/gcache.page.000008 of size 134217728 bytes
Aug 28 14:07:06 db1a mysqld: 2015-08-28 14:07:06 11786 [Note] WSREP: Deleted page /var/data/mysql/gcache.page.000008
Aug 28 16:39:38 db1a mysqld: 2015-08-28 16:39:38 11786 [Note] WSREP: Created page /var/data/mysql/gcache.page.000009 of size 134217728 bytes
Aug 28 16:39:48 db1a mysqld: 2015-08-28 16:39:48 11786 [Note] WSREP: Deleted page /var/data/mysql/gcache.page.000009
Aug 28 19:42:07 db1a mysqld: 2015-08-28 19:42:07 11786 [Note] WSREP: Created page /var/data/mysql/gcache.page.000010 of size 134217728 bytes
Aug 28 19:42:08 db1a mysqld: 2015-08-28 19:42:08 11786 [Note] WSREP: Created page /var/data/mysql/gcache.page.000011 of size 134217728 bytes
Aug 28 19:42:10 db1a mysqld: 2015-08-28 19:42:10 11786 [Warning] WSREP: Failed to report last committed 758795619, -4 (Interrupted system call)
Aug 28 19:42:45 db1a mysqld: 2015-08-28 19:42:45 11786 [Warning] WSREP: Failed to report last committed 758795879, -4 (Interrupted system call)
Aug 28 19:43:07 db1a mysqld: 2015-08-28 19:43:07 11786 [Warning] WSREP: Failed to report last committed 758796011, -4 (Interrupted system call)
Aug 28 19:43:11 db1a mysqld: 2015-08-28 19:43:11 11786 [Warning] WSREP: Failed to report last committed 758796012, -4 (Interrupted system call)
Aug 28 19:43:49 db1a mysqld: 2015-08-28 19:43:49 11786 [Warning] Too many connections
Aug 28 19:43:49 db1a mysqld: 2015-08-28 19:43:49 11786 [Warning] Too many connections
Aug 28 19:43:50 db1a mysqld: 2015-08-28 19:43:50 11786 [Warning] Too many connections
Aug 28 19:43:51 db1a mysqld: 2015-08-28 19:43:51 11786 [Warning] Too many connections

MySQL starts throwing this error at my application:

'SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction

Version: 5.6.24-72.2-56-log Percona XtraDB Cluster (GPL), Release rel72.2, Revision 43abf03, WSREP version 25.11, wsrep_25.11

my.cnf

# -- SERVER ---------------------------------------------- #

[mysqld_safe]

pid-file        = /var/data/run/mysqld.pid
socket          = /var/data/run/mysqld.sock
nice            = 0 
flush_caches    = 1 
numa_interleave = 1 
syslog

[mysqld]

user        = mysql
pid-file    = /var/data/run/mysqld.pid
socket      = /var/data/run/mysqld.sock
port        = 3306
basedir     = /usr
datadir     = /var/data/mysql
tmpdir      = /mnt/tmp

lc-messages-dir = /usr/share/mysql
skip-external-locking

default_time_zone = America/New_York

character-set-server    = utf8
collation-server        = utf8_general_ci

transaction-isolation   = READ-COMMITTED

# -- Cluster Settings -------------------------- #

# Path to Galera library
wsrep_provider = /usr/lib/libgalera_smm.so

# Cluster connection URL contains the IPs of node#1, node#2 and node#3
# It should be empty during bootstrap
wsrep_cluster_address=gcomm://10.0.200.8,10.0.210.7
#wsrep_cluster_address=gcomm://

# In order for Galera to work correctly binlog format should be ROW
binlog_format = ROW

# MyISAM storage engine has only experimental support
default_storage_engine = InnoDB

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode = 2

# We don't trust auto_increment control from galera when nodes are removed and added
# to the cluster, each node has a different offset.
wsrep_auto_increment_control = OFF
auto_increment_increment = 3
auto_increment_offset    = 1

# Node #1 address
wsrep_node_address = 10.0.200.7

# Cluster and node name
wsrep_cluster_name = db1
wsrep_node_name = db1a

# SST method
wsrep_sst_method            = xtrabackup-v2
wsrep_sst_auth              = "db1:XXXXXXXXXXXX"
wsrep_sst_receive_address   = 10.0.200.7
wsrep_sst_donor             = db1b,db1c

wsrep_slave_threads = 4

# ---------------------------------------------- #

# 
# * Timeouts
#
connect_timeout          = 5
lock_wait_timeout        = 3600
interactive_timeout      = 1800
wait_timeout             = 3600

#
# * Buffer
#
key_buffer_size      = 32M
sort_buffer_size     = 4M
read_rnd_buffer_size = 4M
join_buffer_size     = 4M

max_allowed_packet   = 64M
thread_stack         = 512K
thread_cache_size    = 12

table_open_cache     = 4096
open_files_limit     = 65536
max_heap_table_size  = 1G
tmp_table_size       = 1G

myisam-recover       = BACKUP
max_connections      = 500

#
# * Query Cache Configuration
#
# Cache size needs to be set to 0 before start with XtrabDB cluster
# It can ben changed during runtime
# http://www.percona.com/doc/percona-xtradb-cluster/5.6/limitation.html

query_cache_type  = 1
query_cache_limit = 10M
query_cache_size  = 0

#
# * Logging and Replication
#

# It has to be logged to FILE to work with XtraDB Cluster
# http://www.percona.com/doc/percona-xtradb-cluster/5.6/limitation.html
log_output = FILE

#general_log_file        = /var/log/mysql/mysql.log
#general_log             = 1

#log_error = /var/log/mysql/error.log

slow_query_log      = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time     = 10
#log-queries-not-using-indexes

#
# The following can be used as easy to replay backup logs or for replication.
# note: if you are setting up a replication slave, see README.Debian about
#       other settings you may need to change.

server-id           = 1
log_bin             = /var/data/log/mysql-bin.log
expire_logs_days    = 10
max_binlog_size     = 512M
log_bin_trust_function_creators = 1
log-slave-updates

innodb_buffer_pool_size         = 48G
innodb_buffer_pool_instances    = 10
innodb_log_file_size            = 1G
innodb_log_buffer_size          = 256M
innodb_thread_concurrency       = 0
innodb_file_format              = Barracuda
innodb_flush_method             = O_DIRECT

innodb_lock_wait_timeout        = 60

innodb_read_io_threads          = 64
innodb_write_io_threads         = 32

innodb_ft_enable_stopword       = 0
innodb_ft_min_token_size        = 2

innodb_flush_log_at_trx_commit  = 0

innodb_open_files               = 4096

# NUMA improvement
innodb_buffer_pool_populate     = 1

innodb_file_per_table

#
# * Security
#
ssl-ca    = /etc/ssl/certs/CA.crt
ssl-cert  = /etc/mysql/keys/db1a.crt
ssl-key   = /etc/mysql/keys/db1a.key

Best Answer

I think you are not alone with this issue. We stopped the multi-master replication and instead beefed up RAID and XtraBackup and no issues with 5.5 or 5.6.

I have read that once you're using the wsrep and clustering, however, these issues creep in. Unsure if they have a solution yet but your best bet may be to follow this thread:

I hope this helps and that article likely has best "from Percona engineers' mouth" commands to use to diagnose your issue.

To summarize Percona engineer tips, check error logs of course and:

show status like 'wsrep%';

Also some tools mentioned in article. One thing we have in our Ansible playbooks for all servers running DB software is to increase max open file limits (ulimit -n), etc.

I'm not sure if it will help but Peter from Percona recommends explicitly setting max_connections when granting privileges. I've not tried that but this might be useful as well to try and diagnose where the issue is, or if indeed it's related to the cluster.

mysql> GRANT USAGE ON *.* TO 'batchjob1'@'localhost'
    ->     WITH MAX_USER_CONNECTIONS 10;