kworker Flush 99% I/O Hangs File Write on Machine

centoslinux

We have several 90TB servers (Areca RAID-6 partitioned into 10 ext4 partitions).

The application reads a specialized hardware device and writes data to disk at a constant 8 to 10 MBytes/second. (the app runs separate reader and writer threads with a memory buffer between)

Once in a while file operations slow to a crawl, eg, fwrite() taking 10-30 seconds to write 200 bytes of data, ftell() taking a similar time to get the file position, etc.

Sometimes (but not always), when the system slows down top shows all processes with 0 %CPU, similar to this serverfault posting.

The slowdown will last a few minutes and then the machine will be fine. Later it will happen again.

There is no "daily frequency" of the slowdown so I've eliminated something running via cron / anacron etc.

However, during a slowdown the system load goes from 0.4 (normal) to 3-4. iotop often shows kworker/u16:0+flush-8:112 with very high I/O %. top, vmstat etc show very high %wa, so some I/O bottleneck somewhere.

I've changed the I/O scheduler to mq-deadline.

Also, the application runs fsync every 5 seconds on the file system containing the open file being written.

I've been googling for days on what might be causing this slowdown to find what parameters could be adjusted but no luck so far.

Any suggestions on how to tune Linux for this constant-write scenario?

Centos 8

# uname -a
Linux 4.18.0-147.el8.x86_64 #1 SMP Wed Dec 4 21:51:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

IOTOP OUTPUT

Total DISK READ :     161.32 K/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
    PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
b'  10666 rt/0 root      161.32 K/s    0.00 B/s  0.00 %  4.43 % MyWritingApp'
Total DISK READ :     239.58 K/s | Total DISK WRITE :      78.27 M/s
Actual DISK READ:     239.58 K/s | Actual DISK WRITE:       2.20 M/s
    PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
b' 431412 be/4 root      135.01 K/s    0.00 B/s  0.00 % 44.80 % [kworker/u16:8-flush-8:32]'
b'  10666 rt/0 root      104.57 K/s   78.11 M/s  0.00 %  1.88 % MyWritingApp'
b'   8853 be/3 root        0.00 B/s    9.27 K/s  0.00 %  0.01 % [jbd2/sdh1-8]'
b'   1123 ?dif mysql       0.00 B/s  149.57 K/s  0.00 %  0.00 % mysqld --basedir=/usr'
Total DISK READ :     388.68 K/s | Total DISK WRITE :      14.98 M/s
Actual DISK READ:     469.32 K/s | Actual DISK WRITE:    1899.42 K/s
    PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
b' 338764 be/4 root      388.68 K/s    0.00 B/s  0.00 % 89.25 % [kworker/u16:0+flush-8:112]'
b'   1123 ?dif mysql       0.00 B/s  955.83 K/s  0.00 %  0.00 % mysqld --basedir=/usr'
b'   2740 be/4 apache      0.00 B/s 1353.76 B/s  0.00 %  0.00 % httpd -DFOREGROUND'
b'  10664 be/4 root        0.00 B/s 1353.76 B/s  0.00 %  0.00 % php /usr/local/dvstor/bin64/DvStorStartRecording.php mode=ASI tsNum=0'
b'  10666 rt/0 root        0.00 B/s   14.04 M/s  0.00 %  0.00 % MyWritingApp'
Total DISK READ :     609.63 K/s | Total DISK WRITE :      15.87 K/s
Actual DISK READ:     609.63 K/s | Actual DISK WRITE:      31.74 K/s
    PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
b' 338764 be/4 root      601.70 K/s    0.00 B/s  0.00 % 99.43 % [kworker/u16:0+flush-8:112]'
b'    711 be/4 root        7.93 K/s    0.00 B/s  0.00 %  3.00 % systemd-journald'
b'   9246 be/3 root        0.00 B/s    7.93 K/s  0.00 %  0.00 % [jbd2/sdi1-8]'
b'    933 be/3 root        0.00 B/s 1354.15 B/s  0.00 %  0.00 % auditd'
b'   1966 be/4 root        0.00 B/s    2.64 K/s  0.00 %  0.00 % rsyslogd -n'
b'   2740 be/4 apache      0.00 B/s 1354.15 B/s  0.00 %  0.00 % httpd -DFOREGROUND'

VMSTAT OUTPUT

procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
 r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st
 0  2       194048       334164       612708     12537168    0    1    14  1114   13  217   1   2  95   1   0
 0  2       194048       329368       614000     12540308    0    0   646     0 5136 6152   1   1  78  19   0
 0  2       194048       324640       615288     12543276    0    0   622    96 5249 6541   1   2  79  18   0
 0  2       194048       321584       616624     12546188    0    0   664   512 5660 6815   2   2  81  15   0
 0  2       194048       317788       617884     12549124    0    0   626    52 6079 7983   2   3  80  15   0
     15930768 K total memory
      2445452 K used memory
      2673404 K active memory
     11678768 K inactive memory
       318308 K free memory
       617888 K buffer memory
     12549120 K swap cache
      8212476 K total swap
       194048 K used swap
      8018428 K free swap

I/O Scheduler

# for d in {b..k}; do  echo -n "sd${d} " ; cat /sys/block/sd${d}/queue/scheduler;  done
sdb [mq-deadline] kyber bfq none
sdc [mq-deadline] kyber bfq none
sdd [mq-deadline] kyber bfq none
sde [mq-deadline] kyber bfq none
sdf [mq-deadline] kyber bfq none
sdg [mq-deadline] kyber bfq none
sdh [mq-deadline] kyber bfq none
sdi [mq-deadline] kyber bfq none
sdj [mq-deadline] kyber bfq none
sdk [mq-deadline] kyber bfq none

Best Answer

After no responses from the community, here's what I think has solved the problem:

Disabling the journal on the filesystem has seemingly eliminated the incidences of high I/O load and %wa on all partitions. (yes, I know that introduces other risks, but having the machine freeze for I/O is never acceptable in our application)

I disabled the journal using tune2fs:

tune2fs -O ^has_journal <device>

Note Although the filesystem will still be ext4 after removing the journal, parted and file will now incorrectly report it as an ext2 volume. blkid will correctly report the volume as ext4.

Vojtech Trefny suggests this is a bug in libparted according in Inconsistent filesystem type reported after disabling journalling.

EDIT I

Disabling journalling did not solve the problem for all machines.