Improving General dm-crypt (LUKS) Write Performance

btrfscryptsetupdm-cryptluksperformance

I am investigating a problem where encrypting a block device imposes a huge performance penalty when writing to it. Hours of Internet reading and experiments did not provide me with a proper understanding, let alone a solution.

The question in short: Why do I get perfectly fast write speeds when putting a btrfs onto a block device (~170MB/s), while the write speed plummets (~20MB/s) when putting a dm-crypt/LUKS in between the file system and the block device, although the system is more than capable of sustaining a sufficiently high encryption throughput?

Scenario

/home/schlimmchen/random is a 4.0GB file filled with data from /dev/urandom earlier.

dd if=/dev/urandom of=/home/schlimmchen/Documents/random bs=1M count=4096

Reading it is super fast:

$ dd if=/home/schlimmchen/Documents/random of=/dev/null bs=1M
4265841146 bytes (4.3 GB) copied, 6.58036 s, 648 MB/s
$ dd if=/home/schlimmchen/Documents/random of=/dev/null bs=1M
4265841146 bytes (4.3 GB) copied, 0.786102 s, 5.4 GB/s

(the second time, the file was obviously read from cache).

Unencrypted btrfs

The device is directly formatted with btrfs (no partition table on the block device).

$ sudo mkfs.btrfs /dev/sdf
$ sudo mount /dev/sdf /mnt
$ sudo chmod 777 /mnt

Write speed gets as high as ~170MB/s:

$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test1 bs=1M conv=fsync
4265841146 bytes (4.3 GB) copied, 27.1564 s, 157 MB/s
$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test2 bs=1M conv=fsync
4265841146 bytes (4.3 GB) copied, 25.1882 s, 169 MB/s
$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test3 bs=1M conv=fsync
4265841146 bytes (4.3 GB) copied, 29.8419 s, 143 MB/s

Read speed is well above 200MB/s.

$ dd if=/mnt/dd-test1 of=/dev/null bs=1M
4265841146 bytes (4.3 GB) copied, 19.8265 s, 215 MB/s
$ dd if=/mnt/dd-test2 of=/dev/null bs=1M
4265841146 bytes (4.3 GB) copied, 19.9821 s, 213 MB/s
$ dd if=/mnt/dd-test3 of=/dev/null bs=1M
4265841146 bytes (4.3 GB) copied, 19.8561 s, 215 MB/s

Encrypted btrfs on block device

The device is formatted with LUKS, and the resultant device is formatted with btrfs:

$ sudo cryptsetup luksFormat /dev/sdf
$ sudo cryptsetup luksOpen /dev/sdf crypt
$ sudo mkfs.btrfs /dev/mapper/crypt
$ sudo mount /dev/mapper/crypt /mnt
$ sudo chmod 777 /mnt
$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test1 bs=1M conv=fsync
4265841146 bytes (4.3 GB) copied, 210.42 s, 20.3 MB/s
$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test2 bs=1M 
4265841146 bytes (4.3 GB) copied, 207.402 s, 20.6 MB/s

Read speed suffers only marginally (why does it at all?):

$ dd if=/mnt/dd-test1 of=/dev/null bs=1M
4265841146 bytes (4.3 GB) copied, 22.2002 s, 192 MB/s
$ dd if=/mnt/dd-test2 of=/dev/null bs=1M
4265841146 bytes (4.3 GB) copied, 22.0794 s, 193 MB/s

luksDump: http://pastebin.com/i9VYRR0p

Encrypted btrfs in file on btrfs on block device

The write speed "skyrockets" to over 150MB/s when writing into an encrypted file. I put a btrfs onto the block device, allocated a 16GB file, which I lukfsFormat'ed and mounted.

$ sudo mkfs.btrfs /dev/sdf -f
$ sudo mount /dev/sdf /mnt
$ sudo chmod 777 /mnt
$ dd if=/dev/zero of=/mnt/crypted-file bs=1M count=16384 conv=fsync
17179869184 bytes (17 GB) copied, 100.534 s, 171 MB/s
$ sudo cryptsetup luksFormat /mnt/crypted-file
$ sudo cryptsetup luksOpen /mnt/crypted-file crypt
$ sudo mkfs.btrfs /dev/mapper/crypt
$ sudo mount /dev/mapper/crypt /tmp/nested/
$ dd if=/home/schlimmchen/Documents/random of=/tmp/nested/dd-test1 bs=1M conv=fsync
4265841146 bytes (4.3 GB) copied, 26.4524 s, 161 MB/s
$ dd if=/home/schlimmchen/Documents/random of=/tmp/nested/dd-test2 bs=1M conv=fsync
4265841146 bytes (4.3 GB) copied, 27.5601 s, 155 MB/s

Why is the write performance increasing like this? What does this particular nesting of filesystems and block devices achieve to aid in high write speeds?

Setup

The problem is reproducible on two systems running the same distro and kernel. However, I also observed the low write speeds with kernel 3.19.0 on System2.

Device: SanDisk Extreme 64GB USB3.0 USB Stick
System1: Intel NUC 5i5RYH, i5-5250U (Broadwell), 8GB RAM, Samsung 840 EVO 250GB SSD
System2: Lenovo T440p, i5-4300M (Haswell), 16GB RAM, Samsung 850 PRO 256GB SSD
Distro/Kernel: Debian Jessie, 3.16.7
cryptsetup: 1.6.6
/proc/crypto for System1: http://pastebin.com/QUSGMfiS
cryptsetup benchmark for System1: http://pastebin.com/4RxzPFeT
btrfs(-tools) is version 3.17
lsblk -t /dev/sdf: http://pastebin.com/nv49tYWc

Thoughts

Alignment is not the cause as far as I can see. Even if the stick's page size is 16KiB, the cryptsetup payload start is aligned to 2MiB anyway.
--allow-discards (for cryptsetup's luksOpen) did not help, as I was expecting.
While doing a lot less experiments with it, I observed very similar behavior with an external hard drive, connected through a USB3.0 adapter.
It seems to me that the system is writing 64KiB blocks. A systemtrap script I tried indicates that at least. /sys/block/sdf/stat backs this hypothesis up since a lot of writes are merged. So my guess is that writing in too small blocks is not the cause.
No luck with changing the block device queue scheduler to NOOP.
Putting the crypt into an LVM volume did not help.

Best Answer

The answer (as I now know): concurrency.

In short: My sequential write, either using dd or when copying a file (like... in daily use), becomes a pseudo-random write (bad) because four threads are working concurrently on writing the encrypted data to the block device after concurrent encryption (good).

Mitigation (for "older" kernels)

The negative effect can be mitigated by increasing the amount of queued requests in the IO scheduler queue like this:

echo 4096 | sudo tee /sys/block/sdc/queue/nr_requests

In my case this nearly triples (~56MB/s) the throughput for the 4GB random data test explained in my question. Of course, the performance still falls short 100MB/s compared to unencrypted IO.

Investigation

Multicore `blktrace`

I further investigated the problematic scenario in which a btrfs is placed on a top of a LUKS encrypted block device. To show me what write instructions are issued to the actual block device, I used blktrace like this:

sudo blktrace -a write -d /dev/sdc -o - | blkparse -b 1 -i - | grep -w D

What this does is (as far as I was able to comprehend) trace IO request to /dev/sdc which are of type "write", then parse this to human readable output but further restrict the output to action "D", which is (according to man blkparse) "IO issued to driver".

The result was something like this (see about 5000 lines of output of the multicore log):

8,32   0    32732   127.148240056     3  D   W 38036976 + 240 [ksoftirqd/0]
8,32   0    32734   127.149958221     3  D   W 38038176 + 240 [ksoftirqd/0]
8,32   0    32736   127.160257521     3  D   W 38038416 + 240 [ksoftirqd/0]
8,32   1    30264   127.186905632    13  D   W 35712032 + 240 [ksoftirqd/1]
8,32   1    30266   127.196561599    13  D   W 35712272 + 240 [ksoftirqd/1]
8,32   1    30268   127.209431760    13  D   W 35713872 + 240 [ksoftirqd/1]

Column 1: major,minor of the block device
Column 2: CPU ID
Column 3: sequence number
Column 4: time stamp
Column 5: process ID
Column 6: action
Column 7: RWBS data (type, sector, length)

This is a snipped of the output produced while dd'ing the 4GB random data onto the mounted filesystem. It is clear that at least two processes are involved. The remaining log shows that all four processors are actually working on it. Sadly, the write requests are not ordered anymore. While CPU0 is writing somewhere around the 38038416th sector, CPU1, which is scheduled afterwards, is writing somewhere around the 35713872nd sector. That's bad.

Singlecore `blktrace`

I did the same experiment after disabling multi-threading and disabling the second core of my CPU. Of course, only one processor is involved in writing to the stick. But more importantly, the write request are properly sequential, which is why the full write performance of ~170MB/s is achieved in the otherwise same setup.

Have a look at about 5000 lines of output in the singlecore log.

Discussion

Now that I know the cause and the proper google search terms, the information about this problem is bubbling up to the surface. As it turns out, I am not the first one to notice.

Four years ago, a patch brought multi-threaded dm-crypt to the kernel. That commit pretty much matches my findings exactly.
Two years ago, patches were discussed improving dm-crypt performance, including re-ordering of write requests.
One year ago, the topic was still discussed.
Recently, a patch enabling sorting for dm-crypt was finally commited to the kernel.
There is an interesting email with performance tests (which I did not read very much of) concerning this phenomenon.

Fixed in current kernels (>=4.0.2)

Because I (later) found the kernel commit obviously targeted at this exact problem, I wanted to try an updated kernel. [After compiling it myself and then finding out it's already in debian/sid] It turns out that the problem is indeed fixed. I don't know the exact kernel release in which the fix appeared, but the original commit will give clues to anyone interested.

For the record:

$ uname -a
Linux t440p 4.0.0-1-amd64 #1 SMP Debian 4.0.2-1 (2015-05-11) x86_64 GNU/Linux
$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test bs=1M conv=fsync
4294967296 bytes (4.3 GB) copied, 29.7559 s, 144 MB/s

A hat tip to Mikulas Patocka, who authored the commit.

Scenario

Unencrypted btrfs

Encrypted btrfs on block device

Encrypted btrfs in file on btrfs on block device

Setup

Thoughts

Best Answer

Mitigation (for "older" kernels)

Investigation

Multicore blktrace

Singlecore blktrace

Discussion

Fixed in current kernels (>=4.0.2)

Related Solutions

Change Hash-Spec and Iter-Time of dm-crypt LUKS Device – How to Guide

List open dm-crypt LUKS volumes

Related Question

Multicore `blktrace`

Singlecore `blktrace`