Linux – Simple file copy (or write) causes ten second+ latency on Linux filesystem

filesystemslatencylinux

I am running Linux on a spinning hard drive. WDC WD5000LPLX-7, "WD Black Mobile", 7200 RPM.

I noticed a simple file copy (or write) causes fsync() latencies of over ten seconds. Is there some way to avoid this on Linux, without replacing the hardware or changing the cp command[*]? Or is there no other way to avoid this?

[*] I am able to avoid it if I write to the files using O_DIRECT instead.

What is fsync() ?

https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/

rename() + fsync() is used to update a file atomically, to be safe in case of power failure.

Application developers are advised to write configuration/state updates using a separate thread, so they do not freeze the user interface if the write takes a while. (See example: freeze in gnome-shell). However, this advice does not seem as useful when saving user files. For example when you edit files one at a time using an editor in the terminal – vi my-file, edit, wq to finish. Naturally vi waits for fsync() to finish before exiting. You might prefer to use a different editor, but I bet yours does the same thing :-).

Test setup

$ sudo -i
# lvcreate alan_dell_2016 -n test --extents 100%FREE
  Logical volume "test" created.
# ls -l /dev/alan_dell_2016/test
lrwxrwxrwx. 1 root root 7 Feb 18 13:34 /dev/alan_dell_2016/test -> ../dm-3

$ uname -r
4.20.3-200.fc29.x86_64

$ cat /sys/block/sda/queue/scheduler
mq-deadline [bfq] none
$ cat /sys/block/dm-3/queue/scheduler
none

I have reproduced the gnome-shell freeze using the CFQ I/O scheduler. CFQ goes away in the next kernel release anyway, so for the moment I have been configuring my system to use BFQ.

I have also tried the mq-deadline scheduler. With all of these I/O schedulers, I saw fsync() latencies longer than ten seconds. My kernel is built with CONFIG_BLK_WBT_MQ=y. (WBT applies to the deadline scheduler; it does not apply to bfq by default).

# mkfs.ext4 /dev/alan_dell_2016/test
mke2fs 1.44.3 (10-July-2018)
Creating filesystem with 2982912 4k blocks and 746304 inodes
Filesystem UUID: 736bee3c-f0eb-49ee-b5be-de56ef1f38d4
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

# mount /dev/alan_dell_2016/test /mnt
# cd /mnt
# df -h .
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/alan_dell_2016-test   12G   41M   11G   1% /mnt

Test run

# dd if=/dev/zero of=writetest bs=1M count=5k conv=fsync & sleep 1; while true; do time sh -c 'echo 1 > latencytest; time sync latencytest; mv latencytest latencytest2'; sleep 1; killall -0 dd || break; done
[1] 17060

real    1m14.972s
user    0m0.001s
sys 0m0.000s

real    1m14.978s
user    0m0.005s
sys 0m0.002s
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 75.9998 s, 70.6 MB/s
[1]+  Done                    dd if=/dev/zero of=writetest bs=1M count=5k conv=fsync
dd: no process found

# cp writetest copytest & sleep 3; while true; do time sh -c 'echo 1 > latencytest; time sync latencytest; mv latencytest latencytest2'; sleep 3; killall -0 cp || break; done
[1] 17397

real    0m59.479s
user    0m0.000s
sys 0m0.002s
[1]+  Done                    cp -i writetest copytest

real    0m59.504s
user    0m0.037s
sys 0m4.385s
cp: no process found

I suppose this involves filesystem details. If I do the same sort of thing at the block device level, the latency is much lower.

# cd / && umount /mnt
# dd if=/dev/zero of=/dev/alan_dell_2016/test bs=1M count=2000 conv=fsync &
[1] 6681
# dd if=/dev/zero of=/dev/alan_dell_2016/test oflag=sync bs=4096 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.193815 s, 21.1 kB/s

Best Answer

These delays depend on the filesystem. I ran the test on different filesystems. ext3 and ext4 both showed some "bad" results. XFS, btrfs, and bcachefs were "not as bad". I did not gather rigorous statistics, but there are some illustrative numbers in the next section.

UPDATE: The results also depend on the kernel version. A few patches have just been accepted for ext4, which appear very relevant to this problem. Yay!

I tried applying the patches on top of v5.2-rc5-293-g001da3fd278f. Running the tests, I did not see any delays of ten seconds and above. Although it was sometimes close to that. It looked like it might sometimes be worse than my XFS/btrfs/bcachefs results.

DISCLAIMER: My testing has not been very scientific :-). In this update, I made sure I could still reproduce "fsync() latencies of over ten seconds" as a baseline. However, my baseline tests were on kernel 5.1.6-200.fc29, not v5.2-rc5. And I could only reproduce the long delays in the first test (writetest), not the second test (copytest).

Later, I tried testing Firefox at the same time as the copytest. Firefox was version 67.0.4 (with a couple of extensions). (The kernels were the same v5.2-rc5-293-g001da3fd278f with patches on top, v.s. a baseline of 5.1.11-200.fc29).

My Firefox test was to open a new tab to google.com. On the baseline there were delays on tens of seconds, which looked like they might be correlated with similarly long fsync delays. And I think the patched kernel has fixed these ten second plus delays in Firefox.

So I have great hopes for these patches!

That said, after allowing the copytest to reach around the 10G mark, the entire GUI responsiveness started to degrade including the mouse cursor. I forced it to allow a VT switch using Alt+SysRQ+R. After a few VT switches back and forth the GUI crashed.

Log messages from the crash are attached to the end of this answer. Possibly the GUI didn't like being forced like this. Or, it might be that swapping-induced delays trigger some race condition in the Wayland compositor and/or XWayland, perhaps similar to the known bug which gnome-shell "tries to avoid" :-(.

Ted T'so has also been murmuring about more long-term changes to ext4 journalling. So perhaps, some time in the future, there will be yet more improvement in ext4!


[PATCH v2 2/3] jbd2: introduce jbd2_inode dirty range scoping

Currently both journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() operate on the entire address space of each of the inodes associated with a given journal entry. The consequence of this is that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers() while we wait for all the pages under writeback to be written out.

The easiest way to cause this type of workload is do just dd from /dev/zero to a file until it fills the entire filesystem. This can cause journal_finish_inode_data_buffers() to wait for the duration of the entire dd operation.

We can improve this situation by scoping each of the inode dirty ranges associated with a given transaction. We do this via the jbd2_inode structure so that the scoping is contained within jbd2 and so that it follows the lifetime and locking rules for that structure.

This allows us to limit the writeback & wait in journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() respectively to the dirty range for a given struct jdb2_inode, keeping us from waiting forever if the inode in question is still being appended to.

(Old) results on different filesystems

These tests were performed using a kernel built from the bcachefs tree, version v4.20-297-g2252e4b79f8f (2019-02-14).

See also the disclaimers in the update above. I was not able to reproduce all of the ext4 results on v5.1.6. Either my test methods are not reliable, or there was a significant change when going up to v5.1.6.

ext4 bfq: writetest saw 15s, 30s.  copytest 10s, 40s.
ext4 mq-deadline: writetest saw 10s, 30s.  copytest 5s, 45s.

ext3 bfq: writetest saw 20s, 40s.  copytest ~0.2s, once saw 0.5s and 2s.
ext3 mq-deadline: writetest saw 50s.  copytest ~0.2s, very occasionally 1.5s.
ext3 mq-deadline, wbt disabled: writetest saw 10s, 40s.  copytest similar to the above.

ext2 bfq: writetest 0.1 - 0.9s.  copytest ~0.5s.
ext2 mq-deadline: writetest 0.2 - 0.6s.  copytest ~0.4s

xfs bfq: writetest 0.5 - 2s.  copytest 0.5 - 3.5s.
xfs mq-deadline: writetest 0.2s, some 0.5s.  copytest 0 - 3s.

bcachefs bfq: writetest 1.5 - 3s.
bcachefs mq-deadline: writetest 1 - 5s.

btrfs bfq: writetest 0.5-2s, copytest 1 - 2s.
btrfs mq-deadline: writetest ~0.4s, copytest 1 - 4s.

The ext3 figures look better when copying files, but ext3 is not a good idea for latency in general (e.g. see the tytso link :-). ext2 lacks journalling - journalling is generally desirable for robustness, but ext journalling is what causes this latency.

So the alternatives I would be most interested are XFS, the experimental bcachefs, and btrfs. I expect XFS is the simplest to use, at least on mechanical hard drives. One prominent difference is that there is no tool to shrink XFS filesystems, only to grow them.

GUI crash

Log messages corresponding to the "GUI crash" mentioned above.

(Soon after this crash the gdm greeter managed to crash as well. Although those messages were a little different.)

Jun 26 22:27:25 alan-laptop gnome-shell[1814]: Failed to set CRTC mode 1366x768: Permission denied
Jun 26 22:27:25 alan-laptop gnome-shell[1814]: Failed to disable CRTC
Jun 26 22:27:25 alan-laptop gnome-shell[1814]: Failed to disable CRTC
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: unknown object (31), message get_swipe_gesture(no)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 2259)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: unknown object (28), message get_swipe_gesture(no)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 1896)
Jun 26 22:27:27 alan-laptop org.gnome.Shell.desktop[1814]: (EE)
Jun 26 22:27:27 alan-laptop org.gnome.Shell.desktop[1814]: Fatal server error:
Jun 26 22:27:27 alan-laptop org.gnome.Shell.desktop[1814]: (EE) wl_display@1: error 1: invalid arguments for zwp_relative_pointer_manager_v1@18.get_relative_pointer
Jun 26 22:27:27 alan-laptop org.gnome.Shell.desktop[1814]: (EE)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: unknown object (28), message get_swipe_gesture(no)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 2257)
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: WL: unknown object (237), message get_relative_pointer(no)
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 1814)
Jun 26 22:27:28 alan-laptop audit[1842]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=1842 comm="Xwayland" exe="/usr/bin/Xwayland" sig=6 res=1
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: clutter_actor_set_reactive: assertion 'CLUTTER_IS_ACTOR (actor)' failed
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: clutter_actor_set_reactive: assertion 'CLUTTER_IS_ACTOR (actor)' failed
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: clutter_actor_set_reactive: assertion 'CLUTTER_IS_ACTOR (actor)' failed
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: WL: unknown object (28), message get_swipe_gesture(no)
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 2251)
Jun 26 22:27:27 alan-laptop unknown[2257]: Error 22 (Invalid argument) dispatching to Wayland display.
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: clutter_input_device_get_device_type: assertion 'CLUTTER_IS_INPUT_DEVICE (device)' failed
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: JS ERROR: TypeError: device is null
                                               _init/<@resource:///org/gnome/shell/ui/keyboard.js:615:1
Jun 26 22:27:27 alan-laptop unknown[2251]: Error 22 (Invalid argument) dispatching to Wayland display.
Jun 26 22:27:31 alan-laptop systemd[1]: Created slice system-systemd\x2dcoredump.slice.