MariaDB 5.5 InnoDB keeps opening files up to 200k until system hangs

centoslinuxmariadb

MariaDB 5.5 is installed on CentOS and serves a database in the total volume of around 40GBs (as measured per database files on a disk) and 506 tables. This database is being queried by php-fpm, and the issue is that at some point, when traffic increases, the number of open files opened by mysqld process (as per lsof states) grows up to around 200k and system hangs (specifically web requests ar being extremely slow, 5 mins for TTFB).

The server itself is should have enough capacity (Supermicro; X10DRH with 125gb of RAM and SSD) to be able to perform quickly on such a modest load.

Examination of lsof output shows that mysqld process keeps open tables on and where:

[root@mail proc]# lsof | wc -l
95592

Most of them are open tables:

[root@mail proc]# lsof | grep .ibd | wc -l
57331

Before that, lsof showed very big qty of /[aio] and putting innodb_use_native_aio=0 improved the situation a bit

mysql_slow_log does not show queries that are running longer than 5s,

| Innodb_mem_total | 33061601280

My MySQL config is:

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock

symbolic-links=0
innodb_file_per_table = 1
innodb_buffer_pool_size = 1G
innodb_io_capacity = 2000
innodb_read_io_threads = 64
innodb_thread_concurrency = 0
innodb_write_io_threads = 64
innodb_use_native_aio=0
skip-name-resolve=1
max_heap_table_size=256M
tmp_table_size=256M 
slow-query-log=1
long_query_time=1

For some tables mysql.log shows some errors on startup:

191205 7:21:25 InnoDB: Error: trying to open a table, but could not
InnoDB: open the tablespace file './staging/yotpo_rich_snippets.ibd'!
but I'm not sure is that might be a reason for low performance.

Also, database data files are located in /home/ partition, which is:

/dev/md2 /home ext4 grpquota,usrquota,data=ordered,relatime,rw 0 2

ulimit -a output:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515264
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 515264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


[root@mail public_html]# iostat -xm 5 3
Linux 3.14.32-xxxx-grs-ipv6-64 (mail.hostname.com)  01/27/2020  _x86_64_    (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.09    0.00    0.29    0.03    0.00   93.59

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.31    51.87   13.18  223.86     0.84     2.38    27.83     0.13    0.54    3.46    0.37   0.09   2.04
sdb               0.30    51.86   12.08  223.86     0.78     2.38    27.48     0.13    0.53    3.70    0.36   0.09   2.06
md2               0.00     0.00    1.34  262.88     0.09     2.23    17.97     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.08    6.54     0.00     0.14    43.22     0.00    0.00    0.00    0.00   0.00   0.00




avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.29    0.00    0.47    0.01    0.00   94.23

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    33.00    0.00   37.60     0.00     0.43    23.62     0.00    0.09    0.00    0.09   0.09   0.32
sdb               0.00    33.00    0.00   37.60     0.00     0.43    23.62     0.00    0.09    0.00    0.09   0.09   0.32
md2               0.00     0.00    0.00   42.20     0.00     0.28    13.38     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00   20.20     0.00     0.14    13.70     0.00    0.00    0.00    0.00   0.00   0.00




avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.81    0.00    0.60    0.01    0.00   95.58

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    43.80    0.40   58.20     0.01     0.82    28.92     0.01    0.14    0.00    0.14   0.12   0.72
sdb               0.00    43.80    0.00   58.20     0.00     0.82    28.68     0.01    0.10    0.00    0.10   0.08   0.48
md2               0.00     0.00    0.00   69.20     0.00     0.62    18.22     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.40   21.20     0.01     0.17    17.33     0.00    0.00    0.00    0.00   0.00   0.00

Any hint on where to dig would be highly appreciated.

Global variables:

https://pastebin.com/jsjyyExz

Global status:

https://pastebin.com/ArQvBRYm

mysqltuner report:

https://pastebin.com/JmyBkZCh

htop shows that mysqld process eats up to around 80% cpu on peak load time
and show processlist shows some specific queries which have a lot (thousands) of PKs in WHERE IN ( ... ) statements. Unfortunately, there is no easy way to get rid of those statements.

Best Answer

Several possibilities:

  • Some queries are taking a long time (and hanging onto files (that is, tables) meanwhile). -- Look for slow queries; let's work on speeding them up. You have the slowlog turned on; use pt-query-digest to identify the 'worst' queries.
  • The APP uses the "EAV" model, which joins lots of tables. -- If this is the case, let's discuss that.
  • The web server allows a huge number of connections. If it is allowing "too many", then will just stumble over each other. "5mins for TTFB" is a clue of this. -- Throttle connections at the web server layer.
  • Not effectively using RAM. You have 125GB of RAM, yet innodb_buffer_pool_size is only 1G. -- Change that to 80G.