Why does closing a file wait for sync when overwriting a file, but not when creating

cacheext4files

When running this script:

#!/usr/bin/env python3
f = open("foo", "w")
f.write("1"*10000000000)
f.close()
print("closed")

I can observe the following process on my Ubuntu machine:

The memory fills with 10GB.
The Page Cache fills with 10GB of dirty pages. (/proc/meminfo)
"closed" is printed and the script terminates.
A while after, the dirty pages decrease.

However, if file "foo" already exists, close() blocks until all dirty pages have been written back.

What is the reason for this behavior?

This is the strace if the file does NOT exist:

openat(AT_FDCWD, "foo", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
ioctl(3, TCGETS, 0x7ffd50dc76f0)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
ioctl(3, TCGETS, 0x7ffd50dc76c0)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
mmap(NULL, 10000003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd9892e000
mmap(NULL, 10000003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcb4486f000
write(3, "11111111111111111111111111111111"..., 10000000000) = 2147479552
write(3, "11111111111111111111111111111111"..., 7852520448) = 2147479552
write(3, "11111111111111111111111111111111"..., 5705040896) = 2147479552
write(3, "11111111111111111111111111111111"..., 3557561344) = 2147479552
write(3, "11111111111111111111111111111111"..., 1410081792) = 1410081792
munmap(0x7fcb4486f000, 10000003072)     = 0
munmap(0x7fcd9892e000, 10000003072)     = 0
close(3)                                = 0
write(1, "closed\n", 7closed
)                 = 7
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fcfedd5cf20}, {sa_handler=0x62ffc0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fcfedd5cf20}, 8) = 0
sigaltstack(NULL, {ss_sp=0x2941be0, ss_flags=0, ss_size=8192}) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

This is the strace if it exists:

openat(AT_FDCWD, "foo", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
ioctl(3, TCGETS, 0x7fffa00b4fe0)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
ioctl(3, TCGETS, 0x7fffa00b4fb0)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
mmap(NULL, 10000003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71de68b000
mmap(NULL, 10000003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6f8a5cc000
write(3, "11111111111111111111111111111111"..., 10000000000) = 2147479552
write(3, "11111111111111111111111111111111"..., 7852520448) = 2147479552
write(3, "11111111111111111111111111111111"..., 5705040896) = 2147479552
write(3, "11111111111111111111111111111111"..., 3557561344) = 2147479552
write(3, "11111111111111111111111111111111"..., 1410081792) = 1410081792
munmap(0x7f6f8a5cc000, 10000003072)     = 0
munmap(0x7f71de68b000, 10000003072)     = 0
close(3#### strace will block exactly here until write-back is completed ####)                                = 0 
write(1, "closed\n", 7closed
)                 = 7
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f7433ab9f20}, {sa_handler=0x62ffc0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f7433ab9f20}, 8) = 0
sigaltstack(NULL, {ss_sp=0x1c68be0, ss_flags=0, ss_size=8192}) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

The same behaviour can be observed when simply printing and piping into a file instead of using python file-io, as well as when doing the same with a small equivalent C++ program printing to cout. It seems to be the actual systemcall that blocks.

Best Answer

That sounds like a reminder of the O_PONIES fiasco, which just recently had its 11th birthday.

Before ext4 came, ext3 had acquired a sort of a reputation for being stable in the face of power losses. It seldom broke, it seldom lost data from files. Then, ext4 added delayed allocation of data blocks, meaning that it didn't even try to write file data to disk immediately. Normally, that's not a problem as long as the data gets there at some point, and for temporary files, it might turn out that there was no need to write the data to disk at all.

But ext4 did write metadata changes, and recorded that something had changed with the file. Now, if the system crashed, the file was marked as truncated, but the writes after that weren't stored on disk (because no blocks were allocated for them). Hence, on ext4, you'd often see recently-modified files truncated to a zero length after a crash.

That, of course was not exactly what most users wanted, but the argument was made that application programs that cared about their data so much, should have called fsync(), and if they actually cared about renames, they should fsync() (or at least fdatasync()) the containing directory too. Next to no-one did that, though, partly because on ext3, an fsync() synced the whole disk, possibly including large amounts of unrelated data. (Or as close to the whole disk that the difference doesn't matter anyway.)

Now, on one hand, you had ext3 which performed poorly with fsync() and on the other, ext4 that required fsync() to not lose files. Not a nice situation, considering that most application programs would care to implement filesystem-specific behavior even less than the rigid dance with calling fsync() at just the right moments. Apparently it wasn't even easy to figure out if a filesystem was mounted as ext3 or ext4 in the first place.

In the end, the ext4 developers made some changes to the most common critical-seeming cases

Renaming a file on top of another. On a running system, this is an atomic update and is commonly used to put a new version of a file in place.
Overwriting an existing file (your case). This isn't atomic on a running system, but usually means the application wants the file replaced, not truncated. If an overwrite is botched, you'd lose the old version of the file too, so this is a bit different from creating a completely new file where a power-out would only lose the most recent data.

As far as I can remember, XFS also exhibited similar zero-length files after a crash even before ext4. I never followed that, though, so I don't know what sorts of fixes they'd have done.

See, e.g. this article on LWN, which mentions the fixes: ext4 and data loss (March 2009)

_{There were other writings about that at the time, of course, but I'm not sure it's useful to link to them, as it's mostly a question of pointing fingers.}

Best Answer

Related Solutions

Why is there a type for process ids (pid_t), but not for file descriptors (fd)

Linux – Why does Linux purge the memory cache when it is nearly full

Related Question