Linux Performance – Why rename() Takes Longer with fsync()

btrfslinux-kernelperformance

In this test, why does rename() take longer when fsync() is called first?

Environment: btrfs, mechanical HDD, Debian 9 container, running on kernel 5.0.17-200.fc29.x86_64.

Test command: dpkg -r linux-image-4.9.0-9-amd64 >/dev/null 2>&1 && sync && time perf_4.9 trace --no-inherit -s dpkg $FORCE_UNSAFE_IO -i linux-image-4.9.0-9-amd64_4.9.168-1_amd64.deb && time sync

Compare the results of FORCE_UNSAFE_IO="" v.s. FORCE_UNSAFE_IO="--force-unsafe-io".

 dpkg (31632), 374488 events, 100.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   fsync               3442 14849.586     0.002     4.314   149.959      4.11%
   rename              8463 14573.509     0.003     1.722   358.675      4.80%
   wait4                  7  8043.762     0.004  1149.109  8028.468     99.78%
   read               44025  2151.135     0.000     0.049     3.732      0.57%
   open               19301   213.628     0.002     0.011     0.375      0.90%
   write               7846   165.460     0.003     0.021     0.149      0.42%
   sync_file_range     6834    96.513     0.001     0.014     0.822      2.20%
...
real    0m41.703s
user    0m9.709s
sys 0m6.586s

real    0m0.162s
user    0m0.000s
sys 0m0.003s
 dpkg (1919), 334232 events, 100.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   wait4                  7  8290.981     0.007  1184.426  8279.676     99.84%
   read               44399  2168.096     0.000     0.049     2.146      0.50%
   fsync                 25   653.530     0.006    26.141    68.754      8.65%
   rename              8463   522.282     0.003     0.062    69.620     22.53%
   open               12467   163.671     0.002     0.013     0.217      0.97%
   write               7846   160.979     0.003     0.021     0.356      0.50%
   sync_file_range     3417    89.676     0.010     0.026     0.841      2.05%
...
real    0m13.498s
user    0m9.643s
sys 0m5.517s

real    0m0.146s
user    0m0.000s
sys 0m0.004s

The current strategy of dpkg (e.g. in Debian 9) is more sophisticated than you might be aware. I am not sure it would really affect this case though. If you want more details, there is some background in this question: Could AIO fsync improve dpkg performance?

I do not know if this is relevant, but it occurs to me that on some filesystems fsync() may effectively sync the directory as well. This is to make sure newly created files are visible on-disk before fsync() returns. I read somewhere this does not happen on ext2, but it does happen on ext4. As partial evidence, see ext4: make fsync to sync parent dir in no-journal for real this time

In case you are surprised by the trailing sync timings, I can confirm that patching dpkg to replace the individual fsync() calls with a global sync() call appears to keep the overall time down to about 13s. And I have not found anything inadequate in that on my system. dpkg just stopped using that approach because of other potential side effects.[1][2]

Best Answer

Based on the commit description, I expect the rename() delays are caused by Btrfs: sync log after logging new name. This was added in kernel v4.19.

Make the logging of the new file name (which happens when creating a hard link or renaming) persist the log.

This approach not only is simpler, [...] but also gives us the same behaviour as ext4, xfs and f2fs (possibly other filesystems too).

I do not believe the second sentence is correct!

To be fair I should point out dpkg is forgetting to fsync() the directories containing the files, before it records the package as correctly installed. But this btrfs behaviour is not exactly a perfect match with the rest of Linux.

I do not believe XFS syncs the new directory entry inside rename() (i.e. deliberately waits for it to be persisted). My presumption against any sync writes inside XFS rename() is partly based on this thread: https://marc.info/?l=linux-xfs&m=139863577410237&w=2

For ext4, I mentioned evidence that fsync() might sync the new directory entry before it returns. But I do not believe that ext4's rename() does so.

I linked to recent discussions about AIO fsync() operations, and how they might allow efficient batching of metadata updates. There has not been much discussion about a hypothetical AIO rename(), because the usual assumption is that rename() is not a synchronous operation!

(btrfs feels a bit suspicious to me in general. I.e. I see this data integrity bugfix was in the past few releases, and it was not the only scary-sounding fix in the changelog for these releases).


I think the rename() delays must be triggered by the BTRFS_NEED_LOG_SYNC returned from the last line of btrfs_log_new_name().

The way I found this was by using offcputime. It aggregates wait time by stack trace. The stack traces look like this:

io_schedule_timeout
wait_for_completion_io
write_all_supers
btrfs_sync_log
btrfs_sync_file
do_fsync
__x64_sys_fsync
do_syscall_64
entry_SYSCALL_64_after_hwframe
-                dpkg (23528)
    9735954

io_schedule_timeout
wait_for_completion_io
write_all_supers
btrfs_sync_log
btrfs_rename2
vfs_rename
do_renameat2
__x64_sys_rename
do_syscall_64
entry_SYSCALL_64_after_hwframe
-                dpkg (23528)
    9147785

io_schedule
bit_wait_io
__wait_on_bit
out_of_line_wait_on_bit
write_all_supers
btrfs_sync_log
btrfs_sync_file
do_fsync
__x64_sys_fsync
do_syscall_64
entry_SYSCALL_64_after_hwframe
-                dpkg (23528)
    4478158

io_schedule
bit_wait_io
__wait_on_bit
out_of_line_wait_on_bit
write_all_supers
btrfs_sync_log
btrfs_rename2
vfs_rename
do_renameat2
__x64_sys_rename
do_syscall_64
entry_SYSCALL_64_after_hwframe
-                dpkg (23528)
    4376109
Related Question