We have several 90TB servers (Areca RAID-6 partitioned into 10 ext4 partitions).
The application reads a specialized hardware device and writes data to disk at a constant 8 to 10 MBytes/second. (the app runs separate reader and writer threads with a memory buffer between)
Once in a while file operations slow to a crawl, eg, fwrite()
taking 10-30 seconds to write 200 bytes of data, ftell()
taking a similar time to get the file position, etc.
Sometimes (but not always), when the system slows down top
shows all processes with 0 %CPU, similar to this serverfault posting.
The slowdown will last a few minutes and then the machine will be fine. Later it will happen again.
There is no "daily frequency" of the slowdown so I've eliminated something running via cron
/ anacron
etc.
However, during a slowdown the system load goes from 0.4 (normal) to 3-4. iotop
often shows kworker/u16:0+flush-8:112
with very high I/O %. top
, vmstat
etc show very high %wa
, so some I/O bottleneck somewhere.
I've changed the I/O scheduler to mq-deadline
.
Also, the application runs fsync
every 5 seconds on the file system containing the open file being written.
I've been googling for days on what might be causing this slowdown to find what parameters could be adjusted but no luck so far.
Any suggestions on how to tune Linux for this constant-write scenario?
Centos 8
# uname -a
Linux 4.18.0-147.el8.x86_64 #1 SMP Wed Dec 4 21:51:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
IOTOP OUTPUT
Total DISK READ : 161.32 K/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
b' 10666 rt/0 root 161.32 K/s 0.00 B/s 0.00 % 4.43 % MyWritingApp'
Total DISK READ : 239.58 K/s | Total DISK WRITE : 78.27 M/s
Actual DISK READ: 239.58 K/s | Actual DISK WRITE: 2.20 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
b' 431412 be/4 root 135.01 K/s 0.00 B/s 0.00 % 44.80 % [kworker/u16:8-flush-8:32]'
b' 10666 rt/0 root 104.57 K/s 78.11 M/s 0.00 % 1.88 % MyWritingApp'
b' 8853 be/3 root 0.00 B/s 9.27 K/s 0.00 % 0.01 % [jbd2/sdh1-8]'
b' 1123 ?dif mysql 0.00 B/s 149.57 K/s 0.00 % 0.00 % mysqld --basedir=/usr'
Total DISK READ : 388.68 K/s | Total DISK WRITE : 14.98 M/s
Actual DISK READ: 469.32 K/s | Actual DISK WRITE: 1899.42 K/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
b' 338764 be/4 root 388.68 K/s 0.00 B/s 0.00 % 89.25 % [kworker/u16:0+flush-8:112]'
b' 1123 ?dif mysql 0.00 B/s 955.83 K/s 0.00 % 0.00 % mysqld --basedir=/usr'
b' 2740 be/4 apache 0.00 B/s 1353.76 B/s 0.00 % 0.00 % httpd -DFOREGROUND'
b' 10664 be/4 root 0.00 B/s 1353.76 B/s 0.00 % 0.00 % php /usr/local/dvstor/bin64/DvStorStartRecording.php mode=ASI tsNum=0'
b' 10666 rt/0 root 0.00 B/s 14.04 M/s 0.00 % 0.00 % MyWritingApp'
Total DISK READ : 609.63 K/s | Total DISK WRITE : 15.87 K/s
Actual DISK READ: 609.63 K/s | Actual DISK WRITE: 31.74 K/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
b' 338764 be/4 root 601.70 K/s 0.00 B/s 0.00 % 99.43 % [kworker/u16:0+flush-8:112]'
b' 711 be/4 root 7.93 K/s 0.00 B/s 0.00 % 3.00 % systemd-journald'
b' 9246 be/3 root 0.00 B/s 7.93 K/s 0.00 % 0.00 % [jbd2/sdi1-8]'
b' 933 be/3 root 0.00 B/s 1354.15 B/s 0.00 % 0.00 % auditd'
b' 1966 be/4 root 0.00 B/s 2.64 K/s 0.00 % 0.00 % rsyslogd -n'
b' 2740 be/4 apache 0.00 B/s 1354.15 B/s 0.00 % 0.00 % httpd -DFOREGROUND'
VMSTAT OUTPUT
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 2 194048 334164 612708 12537168 0 1 14 1114 13 217 1 2 95 1 0
0 2 194048 329368 614000 12540308 0 0 646 0 5136 6152 1 1 78 19 0
0 2 194048 324640 615288 12543276 0 0 622 96 5249 6541 1 2 79 18 0
0 2 194048 321584 616624 12546188 0 0 664 512 5660 6815 2 2 81 15 0
0 2 194048 317788 617884 12549124 0 0 626 52 6079 7983 2 3 80 15 0
15930768 K total memory
2445452 K used memory
2673404 K active memory
11678768 K inactive memory
318308 K free memory
617888 K buffer memory
12549120 K swap cache
8212476 K total swap
194048 K used swap
8018428 K free swap
I/O Scheduler
# for d in {b..k}; do echo -n "sd${d} " ; cat /sys/block/sd${d}/queue/scheduler; done
sdb [mq-deadline] kyber bfq none
sdc [mq-deadline] kyber bfq none
sdd [mq-deadline] kyber bfq none
sde [mq-deadline] kyber bfq none
sdf [mq-deadline] kyber bfq none
sdg [mq-deadline] kyber bfq none
sdh [mq-deadline] kyber bfq none
sdi [mq-deadline] kyber bfq none
sdj [mq-deadline] kyber bfq none
sdk [mq-deadline] kyber bfq none
Best Answer
After no responses from the community, here's what I think has solved the problem:
Disabling the journal on the filesystem has seemingly eliminated the incidences of high I/O load and
%wa
on all partitions. (yes, I know that introduces other risks, but having the machine freeze for I/O is never acceptable in our application)I disabled the journal using
tune2fs
:Note Although the filesystem will still be ext4 after removing the journal,
parted
andfile
will now incorrectly report it as an ext2 volume.blkid
will correctly report the volume asext4
.Vojtech Trefny suggests this is a bug in
libparted
according in Inconsistent filesystem type reported after disabling journalling.EDIT I
Disabling journalling did not solve the problem for all machines.