Linux Performance – Why rename() Takes Longer with fsync()

btrfslinux-kernelperformance

In this test, why does rename() take longer when fsync() is called first?

Environment: btrfs, mechanical HDD, Debian 9 container, running on kernel 5.0.17-200.fc29.x86_64.

Test command: dpkg -r linux-image-4.9.0-9-amd64 >/dev/null 2>&1 && sync && time perf_4.9 trace --no-inherit -s dpkg $FORCE_UNSAFE_IO -i linux-image-4.9.0-9-amd64_4.9.168-1_amd64.deb && time sync

Compare the results of FORCE_UNSAFE_IO="" v.s. FORCE_UNSAFE_IO="--force-unsafe-io".

 dpkg (31632), 374488 events, 100.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   fsync               3442 14849.586     0.002     4.314   149.959      4.11%
   rename              8463 14573.509     0.003     1.722   358.675      4.80%
   wait4                  7  8043.762     0.004  1149.109  8028.468     99.78%
   read               44025  2151.135     0.000     0.049     3.732      0.57%
   open               19301   213.628     0.002     0.011     0.375      0.90%
   write               7846   165.460     0.003     0.021     0.149      0.42%
   sync_file_range     6834    96.513     0.001     0.014     0.822      2.20%
...
real    0m41.703s
user    0m9.709s
sys 0m6.586s

real    0m0.162s
user    0m0.000s
sys 0m0.003s

 dpkg (1919), 334232 events, 100.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   wait4                  7  8290.981     0.007  1184.426  8279.676     99.84%
   read               44399  2168.096     0.000     0.049     2.146      0.50%
   fsync                 25   653.530     0.006    26.141    68.754      8.65%
   rename              8463   522.282     0.003     0.062    69.620     22.53%
   open               12467   163.671     0.002     0.013     0.217      0.97%
   write               7846   160.979     0.003     0.021     0.356      0.50%
   sync_file_range     3417    89.676     0.010     0.026     0.841      2.05%
...
real    0m13.498s
user    0m9.643s
sys 0m5.517s

real    0m0.146s
user    0m0.000s
sys 0m0.004s

The current strategy of dpkg (e.g. in Debian 9) is more sophisticated than you might be aware. I am not sure it would really affect this case though. If you want more details, there is some background in this question: Could AIO fsync improve dpkg performance?

I do not know if this is relevant, but it occurs to me that on some filesystems fsync() may effectively sync the directory as well. This is to make sure newly created files are visible on-disk before fsync() returns. I read somewhere this does not happen on ext2, but it does happen on ext4. As partial evidence, see ext4: make fsync to sync parent dir in no-journal for real this time

In case you are surprised by the trailing sync timings, I can confirm that patching dpkg to replace the individual fsync() calls with a global sync() call appears to keep the overall time down to about 13s. And I have not found anything inadequate in that on my system. dpkg just stopped using that approach because of other potential side effects.[1][2]

Best Answer

Based on the commit description, I expect the rename() delays are caused by Btrfs: sync log after logging new name. This was added in kernel v4.19.

Make the logging of the new file name (which happens when creating a hard link or renaming) persist the log.

This approach not only is simpler, [...] but also gives us the same behaviour as ext4, xfs and f2fs (possibly other filesystems too).

I do not believe the second sentence is correct!

To be fair I should point out dpkg is forgetting to fsync() the directories containing the files, before it records the package as correctly installed. But this btrfs behaviour is not exactly a perfect match with the rest of Linux.

I do not believe XFS syncs the new directory entry inside rename() (i.e. deliberately waits for it to be persisted). My presumption against any sync writes inside XFS rename() is partly based on this thread: https://marc.info/?l=linux-xfs&m=139863577410237&w=2

For ext4, I mentioned evidence that fsync() might sync the new directory entry before it returns. But I do not believe that ext4's rename() does so.

I linked to recent discussions about AIO fsync() operations, and how they might allow efficient batching of metadata updates. There has not been much discussion about a hypothetical AIO rename(), because the usual assumption is that rename() is not a synchronous operation!

(btrfs feels a bit suspicious to me in general. I.e. I see this data integrity bugfix was in the past few releases, and it was not the only scary-sounding fix in the changelog for these releases).

I think the rename() delays must be triggered by the BTRFS_NEED_LOG_SYNC returned from the last line of btrfs_log_new_name().

The way I found this was by using offcputime. It aggregates wait time by stack trace. The stack traces look like this:

io_schedule_timeout
wait_for_completion_io
write_all_supers
btrfs_sync_log
btrfs_sync_file
do_fsync
__x64_sys_fsync
do_syscall_64
entry_SYSCALL_64_after_hwframe
-                dpkg (23528)
    9735954

io_schedule_timeout
wait_for_completion_io
write_all_supers
btrfs_sync_log
btrfs_rename2
vfs_rename
do_renameat2
__x64_sys_rename
do_syscall_64
entry_SYSCALL_64_after_hwframe
-                dpkg (23528)
    9147785

io_schedule
bit_wait_io
__wait_on_bit
out_of_line_wait_on_bit
write_all_supers
btrfs_sync_log
btrfs_sync_file
do_fsync
__x64_sys_fsync
do_syscall_64
entry_SYSCALL_64_after_hwframe
-                dpkg (23528)
    4478158

io_schedule
bit_wait_io
__wait_on_bit
out_of_line_wait_on_bit
write_all_supers
btrfs_sync_log
btrfs_rename2
vfs_rename
do_renameat2
__x64_sys_rename
do_syscall_64
entry_SYSCALL_64_after_hwframe
-                dpkg (23528)
    4376109

Related Solutions

Why does “ls *” take so much longer than “ls”

When you run ls without arguments, it will just open a directory, read all the contents, sort them and print them out.

When you run ls *, first the shell expands *, which is effectively the same as what the simple ls did, builds an argument vector with all the files in the current directory and calls ls. ls then has to process that argument vector and for each argument, and calls access(2)¹ the file to check it's existence. Then it will print out the same output as the first (simple) ls. Both the shell's processing of the large argument vector and ls's will likely involve a lot of memory allocation of small blocks, which can take some time. However, since there was little sys and user time, but a lot of real time, most of the time would have been spent waiting for disk, rather than using CPU doing memory allocation.

Each call to access(2) will need to read the file's inode to get the permission information. That means a lot more disk reads and seeks than simply reading a directory. I do not know how expensive these operations are on your GPFS, but as the comparison you've shown to ls -l which has a similar run time to the wildcard case, the time needed to retrieve the inode information appears to dominate. If GPFS has a slightly higher latency than your local filesystem on each read operation, we would expect it to be more pronounced in these cases.

The difference between the wildcard case and ls -l of 50% could be explained by the ordering of inodes on the disk. If the inodes were laid out successively in the same order as the filenames in the directory and ls -l stat(2)ed the files in directory order before sorting, ls -l would possibly read most of the inodes in a sweep. With the wildcard, the shell will sort the filenames before passing them to ls, so ls will likely read the inodes in a different order, adding more disk head movement.

It should be noted that your time output will not include the time taken by the shell to expand the wildcard.

If you really want to see what's going on, use strace(1):

strace -o /tmp/ls-star.trace ls *
strace -o /tmp/ls-l-star.trace ls -l *

and have a look which system calls are being performed in each case.

¹ I don't know if access(2) is actually used, or something else such as stat(2). But both probably require an inode lookup (I'm not sure if access(file, 0) would bypass an inode lookup.)

Linux – Why does mouse and keyboard freeze when Linux swaps

In order to run a program, it must be in RAM. During thrashing, all programs get swapped out of RAM and then have to be swapped in again before they can run.

Additionally, for many (but not all) input events, there must be a round trip between your X11 server and your window manager, and the window manager may try to trigger additional events. And since you don't have any RAM, if the window manager writes to a pipe or socket, that has to block until the receiving process reads from it.

Best Answer

Related Solutions

Why does “ls *” take so much longer than “ls”

Linux – Why does mouse and keyboard freeze when Linux swaps

Related Question