I am running Linux on a spinning hard drive. WDC WD5000LPLX-7
, "WD Black Mobile", 7200 RPM.
I noticed a simple file copy (or write) causes fsync() latencies of over ten seconds. Is there some way to avoid this on Linux, without replacing the hardware or changing the cp
command[*]? Or is there no other way to avoid this?
[*] I am able to avoid it if I write to the files using O_DIRECT instead.
What is fsync() ?
https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/
rename() + fsync() is used to update a file atomically, to be safe in case of power failure.
Application developers are advised to write configuration/state updates using a separate thread, so they do not freeze the user interface if the write takes a while. (See example: freeze in gnome-shell). However, this advice does not seem as useful when saving user files. For example when you edit files one at a time using an editor in the terminal – vi my-file
, edit, wq
to finish. Naturally vi
waits for fsync() to finish before exiting. You might prefer to use a different editor, but I bet yours does the same thing :-).
Test setup
$ sudo -i
# lvcreate alan_dell_2016 -n test --extents 100%FREE
Logical volume "test" created.
# ls -l /dev/alan_dell_2016/test
lrwxrwxrwx. 1 root root 7 Feb 18 13:34 /dev/alan_dell_2016/test -> ../dm-3
$ uname -r
4.20.3-200.fc29.x86_64
$ cat /sys/block/sda/queue/scheduler
mq-deadline [bfq] none
$ cat /sys/block/dm-3/queue/scheduler
none
I have reproduced the gnome-shell freeze using the CFQ I/O scheduler. CFQ goes away in the next kernel release anyway, so for the moment I have been configuring my system to use BFQ.
I have also tried the mq-deadline
scheduler. With all of these I/O schedulers, I saw fsync() latencies longer than ten seconds. My kernel is built with CONFIG_BLK_WBT_MQ=y. (WBT applies to the deadline scheduler; it does not apply to bfq
by default).
# mkfs.ext4 /dev/alan_dell_2016/test
mke2fs 1.44.3 (10-July-2018)
Creating filesystem with 2982912 4k blocks and 746304 inodes
Filesystem UUID: 736bee3c-f0eb-49ee-b5be-de56ef1f38d4
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208
Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done
# mount /dev/alan_dell_2016/test /mnt
# cd /mnt
# df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/alan_dell_2016-test 12G 41M 11G 1% /mnt
Test run
# dd if=/dev/zero of=writetest bs=1M count=5k conv=fsync & sleep 1; while true; do time sh -c 'echo 1 > latencytest; time sync latencytest; mv latencytest latencytest2'; sleep 1; killall -0 dd || break; done
[1] 17060
real 1m14.972s
user 0m0.001s
sys 0m0.000s
real 1m14.978s
user 0m0.005s
sys 0m0.002s
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 75.9998 s, 70.6 MB/s
[1]+ Done dd if=/dev/zero of=writetest bs=1M count=5k conv=fsync
dd: no process found
# cp writetest copytest & sleep 3; while true; do time sh -c 'echo 1 > latencytest; time sync latencytest; mv latencytest latencytest2'; sleep 3; killall -0 cp || break; done
[1] 17397
real 0m59.479s
user 0m0.000s
sys 0m0.002s
[1]+ Done cp -i writetest copytest
real 0m59.504s
user 0m0.037s
sys 0m4.385s
cp: no process found
I suppose this involves filesystem details. If I do the same sort of thing at the block device level, the latency is much lower.
# cd / && umount /mnt
# dd if=/dev/zero of=/dev/alan_dell_2016/test bs=1M count=2000 conv=fsync &
[1] 6681
# dd if=/dev/zero of=/dev/alan_dell_2016/test oflag=sync bs=4096 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.193815 s, 21.1 kB/s
Best Answer
These delays depend on the filesystem. I ran the test on different filesystems.
ext3
andext4
both showed some "bad" results. XFS, btrfs, and bcachefs were "not as bad". I did not gather rigorous statistics, but there are some illustrative numbers in the next section.UPDATE: The results also depend on the kernel version. A few patches have just been accepted for
ext4
, which appear very relevant to this problem. Yay!I tried applying the patches on top of
v5.2-rc5-293-g001da3fd278f
. Running the tests, I did not see any delays of ten seconds and above. Although it was sometimes close to that. It looked like it might sometimes be worse than my XFS/btrfs/bcachefs results.DISCLAIMER: My testing has not been very scientific :-). In this update, I made sure I could still reproduce "fsync() latencies of over ten seconds" as a baseline. However, my baseline tests were on kernel
5.1.6-200.fc29
, notv5.2-rc5
. And I could only reproduce the long delays in the first test (writetest
), not the second test (copytest
).Later, I tried testing Firefox at the same time as the
copytest
. Firefox was version 67.0.4 (with a couple of extensions). (The kernels were the same v5.2-rc5-293-g001da3fd278f with patches on top, v.s. a baseline of 5.1.11-200.fc29).My Firefox test was to open a new tab to
google.com
. On the baseline there were delays on tens of seconds, which looked like they might be correlated with similarly long fsync delays. And I think the patched kernel has fixed these ten second plus delays in Firefox.So I have great hopes for these patches!
That said, after allowing the
copytest
to reach around the 10G mark, the entire GUI responsiveness started to degrade including the mouse cursor. I forced it to allow a VT switch using Alt+SysRQ+R. After a few VT switches back and forth the GUI crashed.Log messages from the crash are attached to the end of this answer. Possibly the GUI didn't like being forced like this. Or, it might be that swapping-induced delays trigger some race condition in the Wayland compositor and/or XWayland, perhaps similar to the known bug which gnome-shell "tries to avoid" :-(.
Ted T'so has also been murmuring about more long-term changes to ext4 journalling. So perhaps, some time in the future, there will be yet more improvement in ext4!
(Old) results on different filesystems
These tests were performed using a kernel built from the bcachefs tree, version v4.20-297-g2252e4b79f8f (2019-02-14).
See also the disclaimers in the update above. I was not able to reproduce all of the ext4 results on v5.1.6. Either my test methods are not reliable, or there was a significant change when going up to v5.1.6.
The ext3 figures look better when copying files, but ext3 is not a good idea for latency in general (e.g. see the tytso link :-). ext2 lacks journalling - journalling is generally desirable for robustness, but ext journalling is what causes this latency.
So the alternatives I would be most interested are XFS, the experimental bcachefs, and btrfs. I expect XFS is the simplest to use, at least on mechanical hard drives. One prominent difference is that there is no tool to shrink XFS filesystems, only to grow them.
GUI crash
Log messages corresponding to the "GUI crash" mentioned above.
(Soon after this crash the gdm greeter managed to crash as well. Although those messages were a little different.)