Linux – Simple file copy (or write) causes ten second+ latency on Linux filesystem

filesystemslatencylinux

I am running Linux on a spinning hard drive. WDC WD5000LPLX-7, "WD Black Mobile", 7200 RPM.

I noticed a simple file copy (or write) causes fsync() latencies of over ten seconds. Is there some way to avoid this on Linux, without replacing the hardware or changing the cp command[*]? Or is there no other way to avoid this?

[*] I am able to avoid it if I write to the files using O_DIRECT instead.

What is fsync() ?

https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/

rename() + fsync() is used to update a file atomically, to be safe in case of power failure.

Application developers are advised to write configuration/state updates using a separate thread, so they do not freeze the user interface if the write takes a while. (See example: freeze in gnome-shell). However, this advice does not seem as useful when saving user files. For example when you edit files one at a time using an editor in the terminal – vi my-file, edit, wq to finish. Naturally vi waits for fsync() to finish before exiting. You might prefer to use a different editor, but I bet yours does the same thing :-).

Test setup

$ sudo -i
# lvcreate alan_dell_2016 -n test --extents 100%FREE
  Logical volume "test" created.
# ls -l /dev/alan_dell_2016/test
lrwxrwxrwx. 1 root root 7 Feb 18 13:34 /dev/alan_dell_2016/test -> ../dm-3

$ uname -r
4.20.3-200.fc29.x86_64

$ cat /sys/block/sda/queue/scheduler
mq-deadline [bfq] none
$ cat /sys/block/dm-3/queue/scheduler
none

I have reproduced the gnome-shell freeze using the CFQ I/O scheduler. CFQ goes away in the next kernel release anyway, so for the moment I have been configuring my system to use BFQ.

I have also tried the mq-deadline scheduler. With all of these I/O schedulers, I saw fsync() latencies longer than ten seconds. My kernel is built with CONFIG_BLK_WBT_MQ=y. (WBT applies to the deadline scheduler; it does not apply to bfq by default).

# mkfs.ext4 /dev/alan_dell_2016/test
mke2fs 1.44.3 (10-July-2018)
Creating filesystem with 2982912 4k blocks and 746304 inodes
Filesystem UUID: 736bee3c-f0eb-49ee-b5be-de56ef1f38d4
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

# mount /dev/alan_dell_2016/test /mnt
# cd /mnt
# df -h .
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/alan_dell_2016-test   12G   41M   11G   1% /mnt

Test run

# dd if=/dev/zero of=writetest bs=1M count=5k conv=fsync & sleep 1; while true; do time sh -c 'echo 1 > latencytest; time sync latencytest; mv latencytest latencytest2'; sleep 1; killall -0 dd || break; done
[1] 17060

real    1m14.972s
user    0m0.001s
sys 0m0.000s

real    1m14.978s
user    0m0.005s
sys 0m0.002s
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 75.9998 s, 70.6 MB/s
[1]+  Done                    dd if=/dev/zero of=writetest bs=1M count=5k conv=fsync
dd: no process found

# cp writetest copytest & sleep 3; while true; do time sh -c 'echo 1 > latencytest; time sync latencytest; mv latencytest latencytest2'; sleep 3; killall -0 cp || break; done
[1] 17397

real    0m59.479s
user    0m0.000s
sys 0m0.002s
[1]+  Done                    cp -i writetest copytest

real    0m59.504s
user    0m0.037s
sys 0m4.385s
cp: no process found

I suppose this involves filesystem details. If I do the same sort of thing at the block device level, the latency is much lower.

# cd / && umount /mnt
# dd if=/dev/zero of=/dev/alan_dell_2016/test bs=1M count=2000 conv=fsync &
[1] 6681
# dd if=/dev/zero of=/dev/alan_dell_2016/test oflag=sync bs=4096 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.193815 s, 21.1 kB/s

Best Answer

These delays depend on the filesystem. I ran the test on different filesystems. ext3 and ext4 both showed some "bad" results. XFS, btrfs, and bcachefs were "not as bad". I did not gather rigorous statistics, but there are some illustrative numbers in the next section.

UPDATE: The results also depend on the kernel version. A few patches have just been accepted for ext4, which appear very relevant to this problem. Yay!

I tried applying the patches on top of v5.2-rc5-293-g001da3fd278f. Running the tests, I did not see any delays of ten seconds and above. Although it was sometimes close to that. It looked like it might sometimes be worse than my XFS/btrfs/bcachefs results.

DISCLAIMER: My testing has not been very scientific :-). In this update, I made sure I could still reproduce "fsync() latencies of over ten seconds" as a baseline. However, my baseline tests were on kernel 5.1.6-200.fc29, not v5.2-rc5. And I could only reproduce the long delays in the first test (writetest), not the second test (copytest).

Later, I tried testing Firefox at the same time as the copytest. Firefox was version 67.0.4 (with a couple of extensions). (The kernels were the same v5.2-rc5-293-g001da3fd278f with patches on top, v.s. a baseline of 5.1.11-200.fc29).

My Firefox test was to open a new tab to google.com. On the baseline there were delays on tens of seconds, which looked like they might be correlated with similarly long fsync delays. And I think the patched kernel has fixed these ten second plus delays in Firefox.

So I have great hopes for these patches!

That said, after allowing the copytest to reach around the 10G mark, the entire GUI responsiveness started to degrade including the mouse cursor. I forced it to allow a VT switch using Alt+SysRQ+R. After a few VT switches back and forth the GUI crashed.

Log messages from the crash are attached to the end of this answer. Possibly the GUI didn't like being forced like this. Or, it might be that swapping-induced delays trigger some race condition in the Wayland compositor and/or XWayland, perhaps similar to the known bug which gnome-shell "tries to avoid" :-(.

Ted T'so has also been murmuring about more long-term changes to ext4 journalling. So perhaps, some time in the future, there will be yet more improvement in ext4!

[PATCH v2 2/3] jbd2: introduce jbd2_inode dirty range scoping

Currently both journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() operate on the entire address space of each of the inodes associated with a given journal entry. The consequence of this is that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers() while we wait for all the pages under writeback to be written out.

The easiest way to cause this type of workload is do just dd from /dev/zero to a file until it fills the entire filesystem. This can cause journal_finish_inode_data_buffers() to wait for the duration of the entire dd operation.

We can improve this situation by scoping each of the inode dirty ranges associated with a given transaction. We do this via the jbd2_inode structure so that the scoping is contained within jbd2 and so that it follows the lifetime and locking rules for that structure.

This allows us to limit the writeback & wait in journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() respectively to the dirty range for a given struct jdb2_inode, keeping us from waiting forever if the inode in question is still being appended to.

(Old) results on different filesystems

These tests were performed using a kernel built from the bcachefs tree, version v4.20-297-g2252e4b79f8f (2019-02-14).

See also the disclaimers in the update above. I was not able to reproduce all of the ext4 results on v5.1.6. Either my test methods are not reliable, or there was a significant change when going up to v5.1.6.

ext4 bfq: writetest saw 15s, 30s.  copytest 10s, 40s.
ext4 mq-deadline: writetest saw 10s, 30s.  copytest 5s, 45s.

ext3 bfq: writetest saw 20s, 40s.  copytest ~0.2s, once saw 0.5s and 2s.
ext3 mq-deadline: writetest saw 50s.  copytest ~0.2s, very occasionally 1.5s.
ext3 mq-deadline, wbt disabled: writetest saw 10s, 40s.  copytest similar to the above.

ext2 bfq: writetest 0.1 - 0.9s.  copytest ~0.5s.
ext2 mq-deadline: writetest 0.2 - 0.6s.  copytest ~0.4s

xfs bfq: writetest 0.5 - 2s.  copytest 0.5 - 3.5s.
xfs mq-deadline: writetest 0.2s, some 0.5s.  copytest 0 - 3s.

bcachefs bfq: writetest 1.5 - 3s.
bcachefs mq-deadline: writetest 1 - 5s.

btrfs bfq: writetest 0.5-2s, copytest 1 - 2s.
btrfs mq-deadline: writetest ~0.4s, copytest 1 - 4s.

The ext3 figures look better when copying files, but ext3 is not a good idea for latency in general (e.g. see the tytso link :-). ext2 lacks journalling - journalling is generally desirable for robustness, but ext journalling is what causes this latency.

So the alternatives I would be most interested are XFS, the experimental bcachefs, and btrfs. I expect XFS is the simplest to use, at least on mechanical hard drives. One prominent difference is that there is no tool to shrink XFS filesystems, only to grow them.

GUI crash

Log messages corresponding to the "GUI crash" mentioned above.

(Soon after this crash the gdm greeter managed to crash as well. Although those messages were a little different.)

Jun 26 22:27:25 alan-laptop gnome-shell[1814]: Failed to set CRTC mode 1366x768: Permission denied
Jun 26 22:27:25 alan-laptop gnome-shell[1814]: Failed to disable CRTC
Jun 26 22:27:25 alan-laptop gnome-shell[1814]: Failed to disable CRTC
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: unknown object (31), message get_swipe_gesture(no)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 2259)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: unknown object (28), message get_swipe_gesture(no)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 1896)
Jun 26 22:27:27 alan-laptop org.gnome.Shell.desktop[1814]: (EE)
Jun 26 22:27:27 alan-laptop org.gnome.Shell.desktop[1814]: Fatal server error:
Jun 26 22:27:27 alan-laptop org.gnome.Shell.desktop[1814]: (EE) wl_display@1: error 1: invalid arguments for zwp_relative_pointer_manager_v1@18.get_relative_pointer
Jun 26 22:27:27 alan-laptop org.gnome.Shell.desktop[1814]: (EE)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: unknown object (28), message get_swipe_gesture(no)
Jun 26 22:27:26 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 2257)
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: WL: unknown object (237), message get_relative_pointer(no)
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 1814)
Jun 26 22:27:28 alan-laptop audit[1842]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=1842 comm="Xwayland" exe="/usr/bin/Xwayland" sig=6 res=1
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: clutter_actor_set_reactive: assertion 'CLUTTER_IS_ACTOR (actor)' failed
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: clutter_actor_set_reactive: assertion 'CLUTTER_IS_ACTOR (actor)' failed
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: clutter_actor_set_reactive: assertion 'CLUTTER_IS_ACTOR (actor)' failed
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: WL: unknown object (28), message get_swipe_gesture(no)
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: WL: error in client communication (pid 2251)
Jun 26 22:27:27 alan-laptop unknown[2257]: Error 22 (Invalid argument) dispatching to Wayland display.
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: clutter_input_device_get_device_type: assertion 'CLUTTER_IS_INPUT_DEVICE (device)' failed
Jun 26 22:27:27 alan-laptop gnome-shell[1814]: JS ERROR: TypeError: device is null
                                               _init/<@resource:///org/gnome/shell/ui/keyboard.js:615:1
Jun 26 22:27:27 alan-laptop unknown[2251]: Error 22 (Invalid argument) dispatching to Wayland display.
Jun 26 22:27:31 alan-laptop systemd[1]: Created slice system-systemd\x2dcoredump.slice.

Potential Method #1 - F_DROP_CACHES

I found a method from 2012 that discusses a proposed patch to the Linux kernel in this mail thread titled: Re: [RFC Patch] fs: implement per-file drop caches.

excerpt

Cong> This is a draft patch of implementing per-file drop caches.

Interesting. So can I do this from outside a process? I'm a SysAdmin, so my POV is from noticing, finding and fixing performance problems when the system is under pressure.
Cong> It introduces a new fcntl command  F_DROP_CACHES to drop  
Cong> file caches of a specific file. The reason is that currently  
Cong> we only have a system-wide drop caches interface, it could  
Cong> cause system-wide performance down if we drop all page caches  
Cong> when we actually want to drop the caches of some huge file.
How can I tell how much cache is used by a file? And what is the performance impact of this when run on a busy system? And what does this patch buy us since I figure the VM should already be dropping caches once the system comes under mem pressure...

Cong> Below is small test case for this patch:

The thread includes both a testcase and the actual patch to several files within the Linux kernel which adds an additional function to fs/drop_caches.c called drop_pagecache_file(struct file *filp). This function is then accessible through the frontend tool, fnctl.c via the command F_DROP_CACHES. This case calls this function:

file_drop_caches(filp, arg);

Which handles the dropping of all the caches associated with the given file. From the file include/linux/mm.h:

void file_drop_caches(struct file *filp, unsigned long which);

So this can be used?

I found no evidence that this patch ever made its way into the main Linux kernel code repository, so this option would appear to be available, only if you're willing to recompile the Linux kernel yourself.

Potential Method #2 - Using dd

In that same thread, another user mentions a completely different methodology that makes use of dd.

The following is excerpt from that email

This is useful functionality. Though isn't it already provided with POSIX_FADV_DONTNEED? This functionality was added to GNU dd (8.11) a year ago.
Here are the examples from that patch:
Advise to drop cache for whole file
 $ dd if=ifile iflag=nocache count=0
Ensure drop cache for the whole file
 $ dd of=ofile oflag=nocache conv=notrunc,fdatasync count=0
Drop cache for part of file
 $ dd if=ifile iflag=nocache skip=10 count=10 of=/dev/null
Stream data using just the read-ahead cache
 $ dd if=ifile of=ofile iflag=nocache oflag=nocache

Testing it out

I wasn't 100% positive how to test this out but I came up with the following approach.

make a 100MB file

$ dd if=/dev/urandom of=sample.txt bs=100M count=1

trace file accesses using fatrace
```
$ sudo fatrace | grep sample.txt
```
run top so we can monitor memory usage, note amount free.
```
$ top
```
open file, note amount of free memory now. Note the fatrace of the file sample.txt.
```
$ cat sample.txt > /dev/null
```

drop the file from memory, note amount of free memory now. Note the output of fatrace.

$ sudo dd of=/home/saml/tst/162600/sample.txt \
    oflag=nocache conv=notrunc,fdatasync count=0

Example

In terminal #1:

$ dd if=/dev/urandom of=sample.txt bs=100M count=1
1+0 records in
1+0 records out
104857600 bytes (105 MB) copied, 7.37996 s, 14.2 MB/s

$ ls -l sample.txt 
-rw-rw-r--. 1 saml saml 104857600 Oct 17 22:54 sample.txt

In terminal #2:

$ top
...
KiB Mem:   7968336 total,  6900956 used,  1067380 free,   267080 buffers
...

In terminal #3:

$ sudo fatrace | grep sample.txt

Now open the file, sample.txt, and note the amount of RAM. In terminal #1.

$ cat sample.txt > /dev/null

In terminal #2:

KiB Mem:   7968336 total,  7011896 used,   956440 free,   267336 buffers

Notice the output of fatrace in terminal #3:

cat(25940): R /home/saml/tst/162600/sample.txt
cat(25940): R /home/saml/tst/162600/sample.txt
cat(25940): RC /home/saml/tst/162600/sample.txt

Now remove the file from RAM, in terminal #4:

$ sudo dd of=/home/saml/tst/162600/sample.txt \
    oflag=nocache conv=notrunc,fdatasync count=0

Note the output of fatrace in terminal #2:

dd(26229): O /home/saml/tst/162600/sample.txt
dd(26229): CW /home/saml/tst/162600/sample.txt

Note the RAM in terminal #3:

KiB Mem:   7968336 total,  6908364 used,  1059972 free,   267364 buffers

So it would seem that all of the that was consumed by the file in RAM is freed.

Potential Method #3 - python-fadvise

Thanks to a comment by @frostchutz, there's another tool, a Python script, named [pyadvise][4] which provides a much simpler interface than the above dd methods. This script makes use of the same posix_fadvise(2) interface.

Example

$ sudo pyadvise --help
Usage: 
    pyadvise [options] [FILE]..

Options:
  -h, --help        show this help message and exit
  -w, --willneed    The specified files will be accessed in the near future
  -s, --sequential  The application expects to access the specified files
                    sequentially (with lower offsets read before higher ones)
  -d, --dontneed    The specified files will not be accessed in the near
                    future
  -r, --random      The specified files will be accessed in random order
  -o, --noreuse     The specified files will be accessed only once. Under
                    Linux, this operation is a no-op; see contrib/copyfileobj-
                    fadvise.py in the python-fadvise source tree for an
                    example on how to achieve approximately the same effect
  -n, --normal      Indicates that the application has no advice to give about
                    its access pattern for the specified files. If no advice
                    is given for an open file, this is the default assumption
  -v, --verbose     Explain what is being done

And if we repeat the above test and use pyadvise in place of dd:

$ pyadvise -d /home/saml/tst/162600/sample.txt

I noticed an identical drop in the RAM being consumed as before when I used dd.