Linux – SysBench highlighting abysmal disk write performance on white box vs Tier1 server

iolinuxperformance

Have been testing white box system with SuperMicro X8DTL motherboard with various SATA hard disks such as 7,200RPM Seagate Constellation ES and 10,000 RPM Western Digital VelociRaptor.

Have tested Fedora 15, Fedora 16, Ubuntu 11, running SysBench with the command found here MySQL High Performance blog

Disk write results are abysmal – typically results such as:

Operations performed:  0 Read, 3635 Write, 3635 Other = 7270 Total
    Read 0b  Written 14.199Mb  Total transferred 14.199Mb  (145.39Kb/sec)
    36.35 Requests/sec executed

On the other hand, a tier1 server with exactly the same CPU and 7,200RPM SATA hard disks running Fedora 15 produces these test results:

Operations performed: 0 Read, 151453 Write, 151453 Other = 302906 Total
    Read 0b  Written 591.61Mb  Total transferred 591.61Mb  (5.9159Mb/sec)
    1514.48 Requests/sec executed

I cannot understand how there can be such a massive difference and that the SuperMicro based system is producing such awful disk write performance.

I have tested various things including tweaks to fstab, scheduling, disabling disk standby and using sar, iostat, vmstat and so on to look for potential problems. But %idle, %iowait and so on don't show anything unusual. Also vm.zone_reclaim_mode as indicated here Poor disk performance although default setting on Fedora is already 0.

I tried different BIOS settings for the hard disk including IDE and AHCI. Would expect that AHCI should be the best, but write performance difference between IDE and AHCI options is negligible.

Anyone have ideas?

Best Answer

First, to do general IO testing, I recommend using iozone: http://www.iozone.org/

To properly answer this question, we need more information about the IO subsystems in each server.

However, in general, if you're looking for good IO performance, you need a dedicated hardware RAID card with onboard cache and a battery to back up that cache. This allows the RAID card to perform write-back caching, which can dramatically improve IO performance. The RAID card may also provide better throughput in general compared to onboard controllers.

And finally, an AHCI setting in the BIOS would control an onboard SATA controller. Onboard means it's on the motherboard and is not a server-class standalone hardware RAID card. If IO is not a priority for the workload, a server (whitebox or otherwise) may not have a separate RAID card and may indeed use the onboard controller.

You most assuredly want this BIOS setting set to AHCI, as any other setting will not provide Linux fast, direct access to the drives. If this setting is not making any difference, you might not have any drives connected to the onboard controller, or there may be another misconfiguration which is causing Linux or the BIOS to fall back to IDE-compatibility mode. You can check the kernel boot messages to see what drives the kernel sees and what interface the kernel is using to access those drives.

Mitigation (for "older" kernels)

The negative effect can be mitigated by increasing the amount of queued requests in the IO scheduler queue like this:

echo 4096 | sudo tee /sys/block/sdc/queue/nr_requests

In my case this nearly triples (~56MB/s) the throughput for the 4GB random data test explained in my question. Of course, the performance still falls short 100MB/s compared to unencrypted IO.

Investigation

Multicore `blktrace`

I further investigated the problematic scenario in which a btrfs is placed on a top of a LUKS encrypted block device. To show me what write instructions are issued to the actual block device, I used blktrace like this:

sudo blktrace -a write -d /dev/sdc -o - | blkparse -b 1 -i - | grep -w D

What this does is (as far as I was able to comprehend) trace IO request to /dev/sdc which are of type "write", then parse this to human readable output but further restrict the output to action "D", which is (according to man blkparse) "IO issued to driver".

The result was something like this (see about 5000 lines of output of the multicore log):

8,32   0    32732   127.148240056     3  D   W 38036976 + 240 [ksoftirqd/0]
8,32   0    32734   127.149958221     3  D   W 38038176 + 240 [ksoftirqd/0]
8,32   0    32736   127.160257521     3  D   W 38038416 + 240 [ksoftirqd/0]
8,32   1    30264   127.186905632    13  D   W 35712032 + 240 [ksoftirqd/1]
8,32   1    30266   127.196561599    13  D   W 35712272 + 240 [ksoftirqd/1]
8,32   1    30268   127.209431760    13  D   W 35713872 + 240 [ksoftirqd/1]

Column 1: major,minor of the block device
Column 2: CPU ID
Column 3: sequence number
Column 4: time stamp
Column 5: process ID
Column 6: action
Column 7: RWBS data (type, sector, length)

This is a snipped of the output produced while dd'ing the 4GB random data onto the mounted filesystem. It is clear that at least two processes are involved. The remaining log shows that all four processors are actually working on it. Sadly, the write requests are not ordered anymore. While CPU0 is writing somewhere around the 38038416th sector, CPU1, which is scheduled afterwards, is writing somewhere around the 35713872nd sector. That's bad.

Singlecore `blktrace`

I did the same experiment after disabling multi-threading and disabling the second core of my CPU. Of course, only one processor is involved in writing to the stick. But more importantly, the write request are properly sequential, which is why the full write performance of ~170MB/s is achieved in the otherwise same setup.

Have a look at about 5000 lines of output in the singlecore log.

Discussion

Now that I know the cause and the proper google search terms, the information about this problem is bubbling up to the surface. As it turns out, I am not the first one to notice.

Four years ago, a patch brought multi-threaded dm-crypt to the kernel. That commit pretty much matches my findings exactly.
Two years ago, patches were discussed improving dm-crypt performance, including re-ordering of write requests.
One year ago, the topic was still discussed.
Recently, a patch enabling sorting for dm-crypt was finally commited to the kernel.
There is an interesting email with performance tests (which I did not read very much of) concerning this phenomenon.

Fixed in current kernels (>=4.0.2)

Because I (later) found the kernel commit obviously targeted at this exact problem, I wanted to try an updated kernel. [After compiling it myself and then finding out it's already in debian/sid] It turns out that the problem is indeed fixed. I don't know the exact kernel release in which the fix appeared, but the original commit will give clues to anyone interested.

For the record:

$ uname -a
Linux t440p 4.0.0-1-amd64 #1 SMP Debian 4.0.2-1 (2015-05-11) x86_64 GNU/Linux
$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test bs=1M conv=fsync
4294967296 bytes (4.3 GB) copied, 29.7559 s, 144 MB/s

A hat tip to Mikulas Patocka, who authored the commit.

Best Answer

Related Solutions

Improving General dm-crypt (LUKS) Write Performance

Mitigation (for "older" kernels)

Investigation

Multicore blktrace

Singlecore blktrace

Discussion

Fixed in current kernels (>=4.0.2)

Related Question

Multicore `blktrace`

Singlecore `blktrace`