Linux software raid1: What reasons might cause a read speed reduction

mdadmperformancesoftware-raid

Configuring two PCIe NVMe SSDs as a raid1 Linux software raid instead of boosting read performance has roughly halved the read speed.

In similar Linux software raid1 setups (also SSDs) I have seen an increase in read performance, since now two mirrored block devices can be used for the reads.

What could be potential reasons and lines of investigation to address this performance issue?

Benchmarking was done using fio read with 4k on /dev/md125 (the raid1) , /dev/nvme1n1 and /dev/nvme0n1 its members. Reading from them is faster than reading from /dev/md125.

It seems also other people using software Linux raid1 face counter-intuitive speed reduction instead of a speed gain with raid1 reads (see https://serverfault.com/questions/235199/poor-software-raid10-read-performance-on-linux).

Here are some numbers of the performance benchmarks using fio with random 4k reads on concurrently /dev/nvme1n1p1 and /dev/nvme0n1p1 devices I get this:

 fio4k /dev/nvme1n1p1
 [...]
 read: IOPS=637k, BW=2487MiB/s (2608MB/s)(146GiB/60001msec)
 
 fio4k /dev/nvme0n1p1
 read: IOPS=652k, BW=2545MiB/s (2669MB/s)(149GiB/60001msec)

if I generate a raid1 /dev/md125 with both (/dev/nvme1n1p1, /dev/nvme0n1p1, even skipping a bitmap as not to cause any negative impact)

  mdadm --verbose  --create /dev/md/raid1_nvmes --bitmap=none --assume-clean --level=1 --raid-devices=2 /dev/nvme0n1p1 /dev/nvme1n1p1
  fio4k /dev/md125
  [...]
  read: IOPS=337k, BW=1317MiB/s (1381MB/s)(77.2GiB/60001msec)

Update fio comand line and other infos

this is the fio command used (with the vairables BLOCKDEVICE and BLOCKSIZE being set according to the provided values above, BLOCKSIZE=4k and BLOCKDEVICE, being /dev/nvme0n1p1, /dev/nvme1n1p1 and /dev/md/raid1_nvmes

fio --filename="$BLOCKDEVICE" \
    --direct=1 \
    --rw=randread \
    --readonly \
    --bs="$BLOCKSIZE" \
    --ioengine=libaio \
    --iodepth=256 \
    --runtime=60 \
    --numjobs=4 \
    --time_based \
    --group_reporting \
    --name=iops-test-job \
    --direct=1 \
    --eta-newline=1 2>&1

This is the output of the fio tests I run:

test fio benchmark direct block device /dev/nvme0n1p1

root@ada:/virtualization/machines# cat /usr/local/bin/nn_scripts/nn_fio
#!/bin/bash

set -x
BLOCKDEVICE="$1"
test -b "$BLOCKDEVICE" || { echo "usage: $0 <blockdev> [size_of_io_chunk] [mode: randread]" >&2; exit 1; }

BLOCKSIZE="$2"
test "${BLOCKSIZE%%[kMGT]}" -eq "${BLOCKSIZE%%[kMGT]}" 2>/dev/null || { echo "Run FIO benchmark with block size of 4k";  BLOCKSIZE=4k; }


fio --filename="$BLOCKDEVICE" \
    --direct=1 \
    --rw=randread \
    --readonly \
    --bs="$BLOCKSIZE" \
    --ioengine=libaio \
    --iodepth=256 \
    --runtime=60 \
    --numjobs=4 \
    --time_based \
    --group_reporting \
    --name=iops-test-job \
    --direct=1 \
    --eta-newline=1 2>&1 | tee /root/fio.logs/fio.$(basename "$BLOCKDEVICE:").$BLOCKSIZE.$(date -Iseconds)

root@ada:/virtualization/machines# time /usr/local/bin/nn_scripts/nn_fio /dev/nvme0n1p1                                                        [125/294]
+ BLOCKDEVICE=/dev/nvme0n1p1
+ test -b /dev/nvme0n1p1
+ BLOCKSIZE=
+ test '' -eq ''
+ echo 'Run FIO benchmark with block size of 4k'
Run FIO benchmark with block size of 4k
+ BLOCKSIZE=4k
+ fio --filename=/dev/nvme0n1p1 --direct=1 --rw=randread --readonly --bs=4k --ioengine=libaio --iodepth=256 --runtime=60 --numjobs=4 --time_based --grou
p_reporting --name=iops-test-job --direct=1 --eta-newline=1
++ basename /dev/nvme0n1p1:
++ date -Iseconds
+ tee /root/fio.logs/fio.nvme0n1p1:.4k.2021-02-26T11:41:03+01:00
tee: '/root/fio.logs/fio.nvme0n1p1:.4k.2021-02-26T11:41:03+01:00': No such file or directory
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.12
Starting 4 processes

iops-test-job: (groupid=0, jobs=4): err= 0: pid=28221: Fri Feb 26 11:42:04 2021
  read: IOPS=626k, BW=2446MiB/s (2565MB/s)(143GiB/60001msec)
    slat (usec): min=2, max=625, avg= 4.59, stdev= 3.06
    clat (usec): min=90, max=10696, avg=1629.07, stdev=128.82
     lat (usec): min=96, max=10700, avg=1633.79, stdev=129.08
    clat percentiles (usec):
     |  1.00th=[ 1401],  5.00th=[ 1434], 10.00th=[ 1450], 20.00th=[ 1516],
     | 30.00th=[ 1582], 40.00th=[ 1614], 50.00th=[ 1647], 60.00th=[ 1663],
     | 70.00th=[ 1696], 80.00th=[ 1729], 90.00th=[ 1762], 95.00th=[ 1811],
     | 99.00th=[ 1909], 99.50th=[ 1975], 99.90th=[ 2245], 99.95th=[ 2606],
     | 99.99th=[ 3458]
   bw (  KiB/s): min=479040, max=691888, per=25.00%, avg=626199.33, stdev=37403.47, samples=477
   iops        : min=119760, max=172972, avg=156549.78, stdev=9350.91, samples=477
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.63%, 4=0.36%, 10=0.01%, 20=0.01%
  cpu          : usr=30.55%, sys=69.28%, ctx=38473, majf=0, minf=6433
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=37573862,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=2446MiB/s (2565MB/s), 2446MiB/s-2446MiB/s (2565MB/s-2565MB/s), io=143GiB (154GB), run=60001-60001msec

Disk stats (read/write):
  nvme0n1: ios=37487591/1001, merge=14/185, ticks=15999825/331, in_queue=24175124, util=100.00%

real    1m0.698s
user    1m20.593s
sys     2m46.774s

test fio benchmark on raid1 /dev/nvme1n1p1

root@ada:/virtualization/machines# time /usr/local/bin/nn_scripts/nn_fio "$(realpath "/dev/md/ada:raid1_nvmes")"                                [10/330]
+ BLOCKDEVICE=/dev/md127
+ test -b /dev/md127
+ BLOCKSIZE=
+ test '' -eq ''
+ echo 'Run FIO benchmark with block size of 4k'
Run FIO benchmark with block size of 4k
+ BLOCKSIZE=4k
+ fio --filename=/dev/md127 --direct=1 --rw=randread --readonly --bs=4k --ioengine=libaio --iodepth=256 --runtime=60 --numjobs=4 --time_based --group_re
porting --name=iops-test-job --direct=1 --eta-newline=1
++ basename /dev/md127:
++ date -Iseconds
+ tee /root/fio.logs/fio.md127:.4k.2021-02-26T11:49:06+01:00
tee: '/root/fio.logs/fio.md127:.4k.2021-02-26T11:49:06+01:00': No such file or directory
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.12
Starting 4 processes

iops-test-job: (groupid=0, jobs=4): err= 0: pid=67832: Fri Feb 26 11:50:07 2021
  read: IOPS=322k, BW=1257MiB/s (1318MB/s)(73.6GiB/60001msec)
    slat (usec): min=3, max=535, avg=10.44, stdev= 5.29
    clat (usec): min=47, max=14172, avg=3170.20, stdev=142.99
     lat (usec): min=59, max=14179, avg=3180.78, stdev=143.44
    clat percentiles (usec):
     |  1.00th=[ 2900],  5.00th=[ 2966], 10.00th=[ 2999], 20.00th=[ 3032],
     | 30.00th=[ 3097], 40.00th=[ 3163], 50.00th=[ 3195], 60.00th=[ 3228],
     | 70.00th=[ 3261], 80.00th=[ 3294], 90.00th=[ 3326], 95.00th=[ 3359],
     | 99.00th=[ 3425], 99.50th=[ 3458], 99.90th=[ 3621], 99.95th=[ 3818],
     | 99.99th=[ 5866]
   bw (  KiB/s): min=293472, max=350408, per=24.99%, avg=321583.77, stdev=11302.31, samples=477
   iops        : min=73368, max=87602, avg=80395.91, stdev=2825.56, samples=477
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=99.96%, 10=0.03%, 20=0.01%
  cpu          : usr=18.54%, sys=81.47%, ctx=342, majf=0, minf=11008
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=19303258,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=1257MiB/s (1318MB/s), 1257MiB/s-1257MiB/s (1318MB/s-1318MB/s), io=73.6GiB (79.1GB), run=60001-60001msec

The Linux kernel version is:

root@ada:/virtualization/machines# uname -a
Linux ada 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux

Schedulers used on the nvmes is none:

root@ada:/virtualization/machines# grep . /sys/block/{md127,nvme0n1,nvme1n1}/queue/scheduler
/sys/block/md127/queue/scheduler:none
/sys/block/nvme0n1/queue/scheduler:[none] mq-deadline
/sys/block/nvme1n1/queue/scheduler:[none] mq-deadline

There was a request to provide iostat output for the case a) direct nvme ssd perfomance and b) performace of raid1 of the nvme ssds.

a) direct nvme performance

tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn Device
 543201.33         2.1G         1.5M       6.2G       4.6M nvme1n1
    20.67         1.3k         1.5M       4.0k       4.6M nvme0n1
    25.67         1.3k         1.5M       4.0k       4.6M md127

b) performance of the raid1

tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn Device
 169797.33       663.3M        32.3k       1.9G      97.0k nvme1n1
 159573.67       623.3M        32.3k       1.8G      97.0k nvme0n1
 329367.33         1.3G        32.0k       3.8G      96.0k md127

c) performance of parallel fio benchmark of /dev/nvme1n1p1 and /dev/nvme0n1p1

tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn Device                                                                                       [0/747]
 585589.67         2.2G        20.7M       6.7G      62.0M nvme1n1
 405723.00         1.5G        20.7M       4.6G      62.0M nvme0n1
   421.67         1.1M        20.7M       3.4M      62.0M md127

The two involved NVME devices are Samsung Evo 970

root@ada:/sys/module# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S4EWNM0NC28151E      Samsung SSD 970 EVO Plus 1TB             1         284.89  GB /   1.00  TB    512   B +  0 B   2B2QEXM7
/dev/nvme1n1     S4EWNM0NC28144V      Samsung SSD 970 EVO Plus 1TB             1         284.89  GB /   1.00  TB    512   B +  0 B   2B2QEXM7
r

They are inserted into PCIe Slots into the System using this Adapter. The output of lspci is hence:

root@ada:/sys/module# lspci -vv | grep -i 'nvme ssd controller'
41:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
62:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

The system is a DELL server system with 512GiB ram and two sockets equiped with AMD EPYC 7551 32-Core Processors.

During the benchmarks there have been no dmesg errors.

Best Answer

(For those posting questions involving fio, I strongly recommend that you clearly post the full job you are running and fio version number because these things can have a huge impact on whether you get a correct answer to your question)

fio is reporting more kernel overhead in the mdadm case plus check out the difference in the number of job context switches. You may want to look into letting fio do batching -- https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth-batch-submit -- so each call is allowed to submit more in one go. Additionally, you may want to use /dev/md/raid1_nvmes as the RAID device name IF fio is failing to give disk stats for it with your previous command line.

Another thing to check is the speeds you get when you read from both the underlying disks at the same time. An example job is something like this:

fio --direct=1 --rw=randread --readonly --bs=4k --ioengine=libaio \
  --iodepth=1024 --runtime=60 --time_based \
  --name=solo1 --filename=/dev/nvme0n1p1 --stonewall \
  --name=solo2 --filename=/dev/nvme0n1p1 --stonewall \
  --name=duo1 --filename=/dev/nvme0n1p1 --name=duo2 --filename=/dev/nvme1n1p1

Hopefully the solo job runs by itself and the duo jobs run simultaneously but my fio job format may be a bit rusty so feel free to play about with it or split the duo run off into a separate fio invocation.

Dead end ideas

The concept of mdadm RAID chunk size sadly won't have any bearing on this particular problem. Unlike RAID 0/4/5/6/10 mdadm's RAID 1 doesn't have chunks (see this answer to mdadm raid1 and what chunksize (or blocksize) on 4k drives? or search for --chunk in the mdadm man page).

If the I/O you are doing is a single sequential stream it is not expected that mdadm RAID1 reads should be any faster than those of a single disk. As above this shouldn't apply in this case because a) the reads are random b) multiple parallel readers (via numjobs in the fio case) were taking place.

Related Question