When running this script:
#!/usr/bin/env python3
f = open("foo", "w")
f.write("1"*10000000000)
f.close()
print("closed")
I can observe the following process on my Ubuntu machine:
The memory fills with 10GB.
The Page Cache fills with 10GB of dirty pages. (/proc/meminfo)
"closed" is printed and the script terminates.
A while after, the dirty pages decrease.
However, if file "foo" already exists, close() blocks until all dirty pages have been written back.
What is the reason for this behavior?
This is the strace if the file does NOT exist:
openat(AT_FDCWD, "foo", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
ioctl(3, TCGETS, 0x7ffd50dc76f0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
ioctl(3, TCGETS, 0x7ffd50dc76c0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
lseek(3, 0, SEEK_CUR) = 0
mmap(NULL, 10000003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd9892e000
mmap(NULL, 10000003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcb4486f000
write(3, "11111111111111111111111111111111"..., 10000000000) = 2147479552
write(3, "11111111111111111111111111111111"..., 7852520448) = 2147479552
write(3, "11111111111111111111111111111111"..., 5705040896) = 2147479552
write(3, "11111111111111111111111111111111"..., 3557561344) = 2147479552
write(3, "11111111111111111111111111111111"..., 1410081792) = 1410081792
munmap(0x7fcb4486f000, 10000003072) = 0
munmap(0x7fcd9892e000, 10000003072) = 0
close(3) = 0
write(1, "closed\n", 7closed
) = 7
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fcfedd5cf20}, {sa_handler=0x62ffc0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fcfedd5cf20}, 8) = 0
sigaltstack(NULL, {ss_sp=0x2941be0, ss_flags=0, ss_size=8192}) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
exit_group(0) = ?
+++ exited with 0 +++
This is the strace if it exists:
openat(AT_FDCWD, "foo", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
ioctl(3, TCGETS, 0x7fffa00b4fe0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
ioctl(3, TCGETS, 0x7fffa00b4fb0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
lseek(3, 0, SEEK_CUR) = 0
mmap(NULL, 10000003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71de68b000
mmap(NULL, 10000003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6f8a5cc000
write(3, "11111111111111111111111111111111"..., 10000000000) = 2147479552
write(3, "11111111111111111111111111111111"..., 7852520448) = 2147479552
write(3, "11111111111111111111111111111111"..., 5705040896) = 2147479552
write(3, "11111111111111111111111111111111"..., 3557561344) = 2147479552
write(3, "11111111111111111111111111111111"..., 1410081792) = 1410081792
munmap(0x7f6f8a5cc000, 10000003072) = 0
munmap(0x7f71de68b000, 10000003072) = 0
close(3#### strace will block exactly here until write-back is completed ####) = 0
write(1, "closed\n", 7closed
) = 7
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f7433ab9f20}, {sa_handler=0x62ffc0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f7433ab9f20}, 8) = 0
sigaltstack(NULL, {ss_sp=0x1c68be0, ss_flags=0, ss_size=8192}) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
exit_group(0) = ?
+++ exited with 0 +++
The same behaviour can be observed when simply printing and piping into a file instead of using python file-io, as well as when doing the same with a small equivalent C++ program printing to cout. It seems to be the actual systemcall that blocks.
Best Answer
That sounds like a reminder of the
O_PONIES
fiasco, which just recently had its 11th birthday.Before ext4 came, ext3 had acquired a sort of a reputation for being stable in the face of power losses. It seldom broke, it seldom lost data from files. Then, ext4 added delayed allocation of data blocks, meaning that it didn't even try to write file data to disk immediately. Normally, that's not a problem as long as the data gets there at some point, and for temporary files, it might turn out that there was no need to write the data to disk at all.
But ext4 did write metadata changes, and recorded that something had changed with the file. Now, if the system crashed, the file was marked as truncated, but the writes after that weren't stored on disk (because no blocks were allocated for them). Hence, on ext4, you'd often see recently-modified files truncated to a zero length after a crash.
That, of course was not exactly what most users wanted, but the argument was made that application programs that cared about their data so much, should have called
fsync()
, and if they actually cared about renames, they shouldfsync()
(or at leastfdatasync()
) the containing directory too. Next to no-one did that, though, partly because on ext3, anfsync()
synced the whole disk, possibly including large amounts of unrelated data. (Or as close to the whole disk that the difference doesn't matter anyway.)Now, on one hand, you had ext3 which performed poorly with
fsync()
and on the other, ext4 that requiredfsync()
to not lose files. Not a nice situation, considering that most application programs would care to implement filesystem-specific behavior even less than the rigid dance with callingfsync()
at just the right moments. Apparently it wasn't even easy to figure out if a filesystem was mounted as ext3 or ext4 in the first place.In the end, the ext4 developers made some changes to the most common critical-seeming cases
As far as I can remember, XFS also exhibited similar zero-length files after a crash even before ext4. I never followed that, though, so I don't know what sorts of fixes they'd have done.
See, e.g. this article on LWN, which mentions the fixes: ext4 and data loss (March 2009)
There were other writings about that at the time, of course, but I'm not sure it's useful to link to them, as it's mostly a question of pointing fingers.