Configuring two PCIe NVMe SSDs as a raid1 Linux software raid instead of boosting read performance has roughly halved the read speed.
In similar Linux software raid1 setups (also SSDs) I have seen an increase in read performance, since now two mirrored block devices can be used for the reads.
What could be potential reasons and lines of investigation to address this performance issue?
Benchmarking was done using fio
read with 4k on /dev/md125
(the raid1) , /dev/nvme1n1
and /dev/nvme0n1
its members. Reading from them is faster than reading from /dev/md125
.
It seems also other people using software Linux raid1 face counter-intuitive speed reduction instead of a speed gain with raid1 reads (see https://serverfault.com/questions/235199/poor-software-raid10-read-performance-on-linux).
Here are some numbers of the performance benchmarks using fio with random 4k reads on concurrently /dev/nvme1n1p1
and /dev/nvme0n1p1
devices I get this:
fio4k /dev/nvme1n1p1
[...]
read: IOPS=637k, BW=2487MiB/s (2608MB/s)(146GiB/60001msec)
fio4k /dev/nvme0n1p1
read: IOPS=652k, BW=2545MiB/s (2669MB/s)(149GiB/60001msec)
if I generate a raid1 /dev/md125
with both (/dev/nvme1n1p1
, /dev/nvme0n1p1
, even skipping a bitmap as not to cause any negative impact)
mdadm --verbose --create /dev/md/raid1_nvmes --bitmap=none --assume-clean --level=1 --raid-devices=2 /dev/nvme0n1p1 /dev/nvme1n1p1
fio4k /dev/md125
[...]
read: IOPS=337k, BW=1317MiB/s (1381MB/s)(77.2GiB/60001msec)
Update fio
comand line and other infos
this is the fio command used (with the vairables BLOCKDEVICE
and BLOCKSIZE
being set according to the provided values above, BLOCKSIZE=4k
and BLOCKDEVICE, being /dev/nvme0n1p1
, /dev/nvme1n1p1
and /dev/md/raid1_nvmes
fio --filename="$BLOCKDEVICE" \
--direct=1 \
--rw=randread \
--readonly \
--bs="$BLOCKSIZE" \
--ioengine=libaio \
--iodepth=256 \
--runtime=60 \
--numjobs=4 \
--time_based \
--group_reporting \
--name=iops-test-job \
--direct=1 \
--eta-newline=1 2>&1
This is the output of the fio tests I run:
test fio benchmark direct block device /dev/nvme0n1p1
root@ada:/virtualization/machines# cat /usr/local/bin/nn_scripts/nn_fio
#!/bin/bash
set -x
BLOCKDEVICE="$1"
test -b "$BLOCKDEVICE" || { echo "usage: $0 <blockdev> [size_of_io_chunk] [mode: randread]" >&2; exit 1; }
BLOCKSIZE="$2"
test "${BLOCKSIZE%%[kMGT]}" -eq "${BLOCKSIZE%%[kMGT]}" 2>/dev/null || { echo "Run FIO benchmark with block size of 4k"; BLOCKSIZE=4k; }
fio --filename="$BLOCKDEVICE" \
--direct=1 \
--rw=randread \
--readonly \
--bs="$BLOCKSIZE" \
--ioengine=libaio \
--iodepth=256 \
--runtime=60 \
--numjobs=4 \
--time_based \
--group_reporting \
--name=iops-test-job \
--direct=1 \
--eta-newline=1 2>&1 | tee /root/fio.logs/fio.$(basename "$BLOCKDEVICE:").$BLOCKSIZE.$(date -Iseconds)
root@ada:/virtualization/machines# time /usr/local/bin/nn_scripts/nn_fio /dev/nvme0n1p1 [125/294]
+ BLOCKDEVICE=/dev/nvme0n1p1
+ test -b /dev/nvme0n1p1
+ BLOCKSIZE=
+ test '' -eq ''
+ echo 'Run FIO benchmark with block size of 4k'
Run FIO benchmark with block size of 4k
+ BLOCKSIZE=4k
+ fio --filename=/dev/nvme0n1p1 --direct=1 --rw=randread --readonly --bs=4k --ioengine=libaio --iodepth=256 --runtime=60 --numjobs=4 --time_based --grou
p_reporting --name=iops-test-job --direct=1 --eta-newline=1
++ basename /dev/nvme0n1p1:
++ date -Iseconds
+ tee /root/fio.logs/fio.nvme0n1p1:.4k.2021-02-26T11:41:03+01:00
tee: '/root/fio.logs/fio.nvme0n1p1:.4k.2021-02-26T11:41:03+01:00': No such file or directory
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.12
Starting 4 processes
iops-test-job: (groupid=0, jobs=4): err= 0: pid=28221: Fri Feb 26 11:42:04 2021
read: IOPS=626k, BW=2446MiB/s (2565MB/s)(143GiB/60001msec)
slat (usec): min=2, max=625, avg= 4.59, stdev= 3.06
clat (usec): min=90, max=10696, avg=1629.07, stdev=128.82
lat (usec): min=96, max=10700, avg=1633.79, stdev=129.08
clat percentiles (usec):
| 1.00th=[ 1401], 5.00th=[ 1434], 10.00th=[ 1450], 20.00th=[ 1516],
| 30.00th=[ 1582], 40.00th=[ 1614], 50.00th=[ 1647], 60.00th=[ 1663],
| 70.00th=[ 1696], 80.00th=[ 1729], 90.00th=[ 1762], 95.00th=[ 1811],
| 99.00th=[ 1909], 99.50th=[ 1975], 99.90th=[ 2245], 99.95th=[ 2606],
| 99.99th=[ 3458]
bw ( KiB/s): min=479040, max=691888, per=25.00%, avg=626199.33, stdev=37403.47, samples=477
iops : min=119760, max=172972, avg=156549.78, stdev=9350.91, samples=477
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=99.63%, 4=0.36%, 10=0.01%, 20=0.01%
cpu : usr=30.55%, sys=69.28%, ctx=38473, majf=0, minf=6433
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=37573862,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=2446MiB/s (2565MB/s), 2446MiB/s-2446MiB/s (2565MB/s-2565MB/s), io=143GiB (154GB), run=60001-60001msec
Disk stats (read/write):
nvme0n1: ios=37487591/1001, merge=14/185, ticks=15999825/331, in_queue=24175124, util=100.00%
real 1m0.698s
user 1m20.593s
sys 2m46.774s
test fio benchmark on raid1 /dev/nvme1n1p1
root@ada:/virtualization/machines# time /usr/local/bin/nn_scripts/nn_fio "$(realpath "/dev/md/ada:raid1_nvmes")" [10/330]
+ BLOCKDEVICE=/dev/md127
+ test -b /dev/md127
+ BLOCKSIZE=
+ test '' -eq ''
+ echo 'Run FIO benchmark with block size of 4k'
Run FIO benchmark with block size of 4k
+ BLOCKSIZE=4k
+ fio --filename=/dev/md127 --direct=1 --rw=randread --readonly --bs=4k --ioengine=libaio --iodepth=256 --runtime=60 --numjobs=4 --time_based --group_re
porting --name=iops-test-job --direct=1 --eta-newline=1
++ basename /dev/md127:
++ date -Iseconds
+ tee /root/fio.logs/fio.md127:.4k.2021-02-26T11:49:06+01:00
tee: '/root/fio.logs/fio.md127:.4k.2021-02-26T11:49:06+01:00': No such file or directory
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.12
Starting 4 processes
iops-test-job: (groupid=0, jobs=4): err= 0: pid=67832: Fri Feb 26 11:50:07 2021
read: IOPS=322k, BW=1257MiB/s (1318MB/s)(73.6GiB/60001msec)
slat (usec): min=3, max=535, avg=10.44, stdev= 5.29
clat (usec): min=47, max=14172, avg=3170.20, stdev=142.99
lat (usec): min=59, max=14179, avg=3180.78, stdev=143.44
clat percentiles (usec):
| 1.00th=[ 2900], 5.00th=[ 2966], 10.00th=[ 2999], 20.00th=[ 3032],
| 30.00th=[ 3097], 40.00th=[ 3163], 50.00th=[ 3195], 60.00th=[ 3228],
| 70.00th=[ 3261], 80.00th=[ 3294], 90.00th=[ 3326], 95.00th=[ 3359],
| 99.00th=[ 3425], 99.50th=[ 3458], 99.90th=[ 3621], 99.95th=[ 3818],
| 99.99th=[ 5866]
bw ( KiB/s): min=293472, max=350408, per=24.99%, avg=321583.77, stdev=11302.31, samples=477
iops : min=73368, max=87602, avg=80395.91, stdev=2825.56, samples=477
lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=99.96%, 10=0.03%, 20=0.01%
cpu : usr=18.54%, sys=81.47%, ctx=342, majf=0, minf=11008
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=19303258,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=1257MiB/s (1318MB/s), 1257MiB/s-1257MiB/s (1318MB/s-1318MB/s), io=73.6GiB (79.1GB), run=60001-60001msec
The Linux kernel version is:
root@ada:/virtualization/machines# uname -a
Linux ada 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux
Schedulers used on the nvmes is none
:
root@ada:/virtualization/machines# grep . /sys/block/{md127,nvme0n1,nvme1n1}/queue/scheduler
/sys/block/md127/queue/scheduler:none
/sys/block/nvme0n1/queue/scheduler:[none] mq-deadline
/sys/block/nvme1n1/queue/scheduler:[none] mq-deadline
There was a request to provide iostat
output for the case a) direct nvme ssd perfomance and b) performace of raid1 of the nvme ssds.
a) direct nvme performance
tps kB_read/s kB_wrtn/s kB_read kB_wrtn Device
543201.33 2.1G 1.5M 6.2G 4.6M nvme1n1
20.67 1.3k 1.5M 4.0k 4.6M nvme0n1
25.67 1.3k 1.5M 4.0k 4.6M md127
b) performance of the raid1
tps kB_read/s kB_wrtn/s kB_read kB_wrtn Device
169797.33 663.3M 32.3k 1.9G 97.0k nvme1n1
159573.67 623.3M 32.3k 1.8G 97.0k nvme0n1
329367.33 1.3G 32.0k 3.8G 96.0k md127
c) performance of parallel fio benchmark of /dev/nvme1n1p1
and /dev/nvme0n1p1
tps kB_read/s kB_wrtn/s kB_read kB_wrtn Device [0/747]
585589.67 2.2G 20.7M 6.7G 62.0M nvme1n1
405723.00 1.5G 20.7M 4.6G 62.0M nvme0n1
421.67 1.1M 20.7M 3.4M 62.0M md127
The two involved NVME devices are Samsung Evo 970
root@ada:/sys/module# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S4EWNM0NC28151E Samsung SSD 970 EVO Plus 1TB 1 284.89 GB / 1.00 TB 512 B + 0 B 2B2QEXM7
/dev/nvme1n1 S4EWNM0NC28144V Samsung SSD 970 EVO Plus 1TB 1 284.89 GB / 1.00 TB 512 B + 0 B 2B2QEXM7
r
They are inserted into PCIe Slots into the System using this Adapter. The output of lspci is hence:
root@ada:/sys/module# lspci -vv | grep -i 'nvme ssd controller'
41:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
62:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
The system is a DELL server system with 512GiB ram and two sockets equiped with AMD EPYC 7551 32-Core Processors.
During the benchmarks there have been no dmesg
errors.
Best Answer
(For those posting questions involving
fio
, I strongly recommend that you clearly post the full job you are running and fio version number because these things can have a huge impact on whether you get a correct answer to your question)fio
is reporting more kernel overhead in the mdadm case plus check out the difference in the number of job context switches. You may want to look into letting fio do batching -- https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth-batch-submit -- so each call is allowed to submit more in one go. Additionally, you may want to use/dev/md/raid1_nvmes
as the RAID device name IF fio is failing to give disk stats for it with your previous command line.Another thing to check is the speeds you get when you read from both the underlying disks at the same time. An example job is something like this:
Hopefully the
solo
job runs by itself and theduo
jobs run simultaneously but my fio job format may be a bit rusty so feel free to play about with it or split the duo run off into a separate fio invocation.Dead end ideas
The concept of mdadm RAID chunk size sadly won't have any bearing on this particular problem. Unlike RAID 0/4/5/6/10 mdadm's RAID 1 doesn't have chunks (see this answer to mdadm raid1 and what chunksize (or blocksize) on 4k drives? or search for
--chunk
in the mdadm man page).If the I/O you are doing is a single sequential stream it is not expected that mdadm RAID1 reads should be any faster than those of a single disk. As above this shouldn't apply in this case because a) the reads are random b) multiple parallel readers (via
numjobs
in the fio case) were taking place.