Is there any way to make the system more consistent when using LUKS? (or slow storage in general) as it is everything is snappy until the write buffer is full, then everything grinds to a halt as the kernel blocks writes. Same issue on my laptop with the slow SSD – its fine, then I have to wait for 30 seconds while it flushes, meanwhile I can do nearly nothing. I'm hoping to tune the disk cache system? Alternatively, if I can get things to not completely cease while blocked, instead only blocking the write that's actually being blocked?
Linux – slow media – disk cache tuning
encryptionlinuxperformancestorage
Related Solutions
The symptoms are very consistent with a mostly saturated IO system, however having for the most part ruled out IO load from the OS/userspace side, another possibility is the drive running self-tests on itself, which may include reading from all the sectors. This should be queryable/tunable from smartctl (At least one place being smartctl -c for querying).
As for why it's coming and going and started suddenly now:
- The drive has passed a certain stage in it's life (number of sectors written, time spun up, etc.) and the firmware on the drive have triggered one of these scans
- I believe this also can be triggered via smartctl, so it's possible some automated process triggered it
- Having one of these scans triggered and flagged as either in progress or started, when the drive has spent a certain amount of time powered on, it's re-triggered either from the beginning or to resume where it left off
The answer (as I now know): concurrency.
In short: My sequential write, either using dd
or when copying a file (like... in daily use), becomes a pseudo-random write (bad) because four threads are working concurrently on writing the encrypted data to the block device after concurrent encryption (good).
Mitigation (for "older" kernels)
The negative effect can be mitigated by increasing the amount of queued requests in the IO scheduler queue like this:
echo 4096 | sudo tee /sys/block/sdc/queue/nr_requests
In my case this nearly triples (~56MB/s) the throughput for the 4GB random data test explained in my question. Of course, the performance still falls short 100MB/s compared to unencrypted IO.
Investigation
Multicore blktrace
I further investigated the problematic scenario in which a btrfs is placed on a top of a LUKS encrypted block device. To show me what write instructions are issued to the actual block device, I used blktrace
like this:
sudo blktrace -a write -d /dev/sdc -o - | blkparse -b 1 -i - | grep -w D
What this does is (as far as I was able to comprehend) trace IO request to /dev/sdc
which are of type "write", then parse this to human readable output but further restrict the output to action "D", which is (according to man blkparse
) "IO issued to driver".
The result was something like this (see about 5000 lines of output of the multicore log):
8,32 0 32732 127.148240056 3 D W 38036976 + 240 [ksoftirqd/0]
8,32 0 32734 127.149958221 3 D W 38038176 + 240 [ksoftirqd/0]
8,32 0 32736 127.160257521 3 D W 38038416 + 240 [ksoftirqd/0]
8,32 1 30264 127.186905632 13 D W 35712032 + 240 [ksoftirqd/1]
8,32 1 30266 127.196561599 13 D W 35712272 + 240 [ksoftirqd/1]
8,32 1 30268 127.209431760 13 D W 35713872 + 240 [ksoftirqd/1]
- Column 1: major,minor of the block device
- Column 2: CPU ID
- Column 3: sequence number
- Column 4: time stamp
- Column 5: process ID
- Column 6: action
- Column 7: RWBS data (type, sector, length)
This is a snipped of the output produced while dd
'ing the 4GB random data onto the mounted filesystem. It is clear that at least two processes are involved. The remaining log shows that all four processors are actually working on it. Sadly, the write requests are not ordered anymore. While CPU0 is writing somewhere around the 38038416th sector, CPU1, which is scheduled afterwards, is writing somewhere around the 35713872nd sector. That's bad.
Singlecore blktrace
I did the same experiment after disabling multi-threading and disabling the second core of my CPU. Of course, only one processor is involved in writing to the stick. But more importantly, the write request are properly sequential, which is why the full write performance of ~170MB/s is achieved in the otherwise same setup.
Have a look at about 5000 lines of output in the singlecore log.
Discussion
Now that I know the cause and the proper google search terms, the information about this problem is bubbling up to the surface. As it turns out, I am not the first one to notice.
- Four years ago, a patch brought multi-threaded dm-crypt to the kernel. That commit pretty much matches my findings exactly.
- Two years ago, patches were discussed improving dm-crypt performance, including re-ordering of write requests.
- One year ago, the topic was still discussed.
- Recently, a patch enabling sorting for dm-crypt was finally commited to the kernel.
- There is an interesting email with performance tests (which I did not read very much of) concerning this phenomenon.
Fixed in current kernels (>=4.0.2)
Because I (later) found the kernel commit obviously targeted at this exact problem, I wanted to try an updated kernel. [After compiling it myself and then finding out it's already in debian/sid
] It turns out that the problem is indeed fixed. I don't know the exact kernel release in which the fix appeared, but the original commit will give clues to anyone interested.
For the record:
$ uname -a
Linux t440p 4.0.0-1-amd64 #1 SMP Debian 4.0.2-1 (2015-05-11) x86_64 GNU/Linux
$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test bs=1M conv=fsync
4294967296 bytes (4.3 GB) copied, 29.7559 s, 144 MB/s
A hat tip to Mikulas Patocka, who authored the commit.
Best Answer
There are a couple options. You can use ionice to set priorities for certain things. You can also try a different elevator, deadline would probably make more sense in your case:
http://www.redhat.com/magazine/008jun05/features/schedulers/
http://wlug.org.nz/LinuxIoScheduler