In this test, why does rename() take longer when fsync() is called first?
Environment: btrfs, mechanical HDD, Debian 9 container, running on kernel 5.0.17-200.fc29.x86_64.
Test command: dpkg -r linux-image-4.9.0-9-amd64 >/dev/null 2>&1 && sync && time perf_4.9 trace --no-inherit -s dpkg $FORCE_UNSAFE_IO -i linux-image-4.9.0-9-amd64_4.9.168-1_amd64.deb && time sync
Compare the results of FORCE_UNSAFE_IO=""
v.s. FORCE_UNSAFE_IO="--force-unsafe-io"
.
dpkg (31632), 374488 events, 100.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
fsync 3442 14849.586 0.002 4.314 149.959 4.11%
rename 8463 14573.509 0.003 1.722 358.675 4.80%
wait4 7 8043.762 0.004 1149.109 8028.468 99.78%
read 44025 2151.135 0.000 0.049 3.732 0.57%
open 19301 213.628 0.002 0.011 0.375 0.90%
write 7846 165.460 0.003 0.021 0.149 0.42%
sync_file_range 6834 96.513 0.001 0.014 0.822 2.20%
...
real 0m41.703s
user 0m9.709s
sys 0m6.586s
real 0m0.162s
user 0m0.000s
sys 0m0.003s
dpkg (1919), 334232 events, 100.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
wait4 7 8290.981 0.007 1184.426 8279.676 99.84%
read 44399 2168.096 0.000 0.049 2.146 0.50%
fsync 25 653.530 0.006 26.141 68.754 8.65%
rename 8463 522.282 0.003 0.062 69.620 22.53%
open 12467 163.671 0.002 0.013 0.217 0.97%
write 7846 160.979 0.003 0.021 0.356 0.50%
sync_file_range 3417 89.676 0.010 0.026 0.841 2.05%
...
real 0m13.498s
user 0m9.643s
sys 0m5.517s
real 0m0.146s
user 0m0.000s
sys 0m0.004s
The current strategy of dpkg
(e.g. in Debian 9) is more sophisticated than you might be aware. I am not sure it would really affect this case though. If you want more details, there is some background in this question: Could AIO fsync improve dpkg performance?
I do not know if this is relevant, but it occurs to me that on some filesystems fsync() may effectively sync the directory as well. This is to make sure newly created files are visible on-disk before fsync() returns. I read somewhere this does not happen on ext2, but it does happen on ext4. As partial evidence, see ext4: make fsync to sync parent dir in no-journal for real this time
In case you are surprised by the trailing sync
timings, I can confirm that patching dpkg
to replace the individual fsync() calls with a global sync() call appears to keep the overall time down to about 13s. And I have not found anything inadequate in that on my system. dpkg
just stopped using that approach because of other potential side effects.[1][2]
Best Answer
Based on the commit description, I expect the rename() delays are caused by Btrfs: sync log after logging new name. This was added in kernel v4.19.
I do not believe the second sentence is correct!
To be fair I should point out
dpkg
is forgetting to fsync() the directories containing the files, before it records the package as correctly installed. But this btrfs behaviour is not exactly a perfect match with the rest of Linux.I do not believe XFS syncs the new directory entry inside rename() (i.e. deliberately waits for it to be persisted). My presumption against any sync writes inside XFS rename() is partly based on this thread: https://marc.info/?l=linux-xfs&m=139863577410237&w=2
For ext4, I mentioned evidence that fsync() might sync the new directory entry before it returns. But I do not believe that ext4's rename() does so.
I linked to recent discussions about AIO fsync() operations, and how they might allow efficient batching of metadata updates. There has not been much discussion about a hypothetical AIO rename(), because the usual assumption is that rename() is not a synchronous operation!
(btrfs feels a bit suspicious to me in general. I.e. I see this data integrity bugfix was in the past few releases, and it was not the only scary-sounding fix in the changelog for these releases).
I think the rename() delays must be triggered by the
BTRFS_NEED_LOG_SYNC
returned from the last line of btrfs_log_new_name().The way I found this was by using offcputime. It aggregates wait time by stack trace. The stack traces look like this: