Linux – Why is it slower to write the same data to a *larger* pre-allocated file

ext4linuxperformance

I am writing 4 * 4KB blocks to a file. It is consistently around 50% slower if I have used fallocate() to pre-allocate the file with 9 blocks, instead of only pre-allocating the 4 blocks. Why?

There seems to be a cut-off point between pre-allocating 8 and 9 blocks. I'm also wondering why the 1st and 2nd block writes are consistently slower.

This test is boiled down from some file copy code I'm playing with. Inspired by this question about dd, I am using O_DSYNC writes so that I can measure the real progress of the disk writes. (The full idea was to start copying a small block to measure minimum latency, then adaptively increase block size to improve throughput).

I am testing Fedora 28, on a laptop with a spinning hard disk drive. It was upgraded from an earlier Fedora, so the filesystem is not brand-new. I don't think I've been fiddling with the filesystem defaults.

  • Kernel: 4.17.19-200.fc28.x86_64
  • Filesystem: ext4, on LVM.
  • Mount options: rw,relatime,seclabel
  • Fields from tune2fs -l
    • Default mount options: user_xattr acl
    • Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
    • Filesystem flags: signed_directory_hash
    • Block size: 4096
    • Free blocks: 7866091

Timings from strace -s3 -T test-program.py:

openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000048>
write(3, "\0\0\0"..., 4096)             = 4096 <0.036378>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033380>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033359>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033399>
close(3)                                = 0 <0.000033>
openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000110>
fallocate(3, 0, 0, 16384)               = 0 <0.016467>
fsync(3)                                = 0 <0.000201>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033062>
write(3, "\0\0\0"..., 4096)             = 4096 <0.013806>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008324>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008346>
close(3)                                = 0 <0.000025>
openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000070>
fallocate(3, 0, 0, 32768)               = 0 <0.019096>
fsync(3)                                = 0 <0.000311>
write(3, "\0\0\0"..., 4096)             = 4096 <0.032882>
write(3, "\0\0\0"..., 4096)             = 4096 <0.010824>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008188>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008266>
close(3)                                = 0 <0.000012>
openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000050>
fallocate(3, 0, 0, 36864)               = 0 <0.022417>
fsync(3)                                = 0 <0.000260>
write(3, "\0\0\0"..., 4096)             = 4096 <0.032953>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033265>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033317>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033237>
close(3)                                = 0 <0.000019>

test-program.py:

#! /usr/bin/python3
import os

# Required third party module,
# install with "pip3 install --user fallocate".
from fallocate import fallocate

block = b'\0' * 4096

for alloc in [0, 4, 8, 9]:
    # Open file for writing, with implicit fdatasync().
    fd = os.open("out.tmp", os.O_WRONLY | os.O_DSYNC |
                            os.O_CREAT | os.O_TRUNC)

    # Try to pre-allocate space
    if alloc:
        fallocate(fd, 0, alloc * 4096)

    os.write(fd, block)
    os.write(fd, block)
    os.write(fd, block)
    os.write(fd, block)

    os.close(fd)

Best Answer

The reason for the difference between 8 and 9 4KB blocks is because ext4 has a heuristic when converting an unallocated extent created by fallocate() to an allocated extent. For unallocated extents 32KB or less, it just fills the whole extent with zeroes and rewrites the whole thing, while larger extents are split into two or three smaller extents and written out.

In the 8-block case, the whole 32KB extent is converted to a normal extent, the first 16KB is written with your data and the remainder is zero-filled and written out. In the 9-block case, the 36KB extent is split (because it is over 32KB), and you are left with a 16KB extent for your data and a 20KB unwritten extent.

Strictly speaking, the 20KB unwritten extent should also just be zero filled and written out, but I suspect it doesn't do that. However, that would just change the break-even point a bit (to 16KB+32KB = 12 blocks in your case), but wouldn't change the underlying behavior.

You could use filefrag -v out.tmp after the first write to see the block allocation layout on disk.

That said, you could just avoid fallocate and O_DSYNC completely and let the filesystem do its job to write out the data as quickly as possible instead of making the file layout worse than it needs to be....

Related Question