Which filesystems require fsync() for crash-safety when replacing an existing file with rename()

crashfilesystems

Following widespread complaints, ext4 gained a crash-safety guarantee called auto_da_alloc which is enabled by default. What about other filesystems? Out of the most well-known filesystems, which of them provide this same guarantee (and which of them do not)?

Personally I am interested in hearing information about

  • XFS – Red Hat Enterprise Linux default filesystem.
  • btrfs – SuSE Enterprise default filesystem.
  • bcachefs – out-of-tree Linux filesystem, derived from bcache. "The COW filesystem for Linux that won't eat your data."

This issue mostly concerns Linux, as per the history below. It would be interesting to know how ZFS behaves as well, but I tend to assume it wouldn't implement this.

What is auto_da_alloc?

fsync() is well-documented as the correct way to write file data e.g. when you hit "save" in a text editor. And it is widely understood that e.g. text editors must replace existing files atomically using rename(). This is meant to protect against loss of power, making sure that you always either keep the old file, or get the new file (which was fsync()ed before the rename). You don't want to be left with only a half-written version of the new file.

But there was a problem that calling fsync() on ext3, which was the most popular Linux filesystem, could effectively leave the whole system hanging for tens of seconds. Since applications can do nothing about this, it was very common to optimistically use rename() without fsync(). That pattern seemed to work rather well on this filesystem, even if the system lost power.

Therefore, applications exist which do not use fsync() correctly.

The next version of the filesystem, ext4, generally avoided the fsync() hang. At the same time, it started relying much more on the correct use of fsync().

This is all pretty bad. Understanding this history is arguably not helped by dismissive phrases used by many of the conflicting kernel developers.

This was resolved in ext4, to support the rename() pattern without requiring fsync() for crash-safety provide behaviour in a crash as the old ext3 filesystem did. This behaviour can be disabled again if you mount with the option noauto_da_alloc.

Best Answer

There is a mistake in this question. The question implied that this scenario is made fully crash-safe by auto_da_alloc. This is not true for ext4. I presume it was not true in the old ext3 either. However it is true for btrfs, and for bcachefs.

Recent ext4 does have a special workaround to reduce the chance of replace-via- rename producing zero-length files by forcing out the new data blocks upon rename . However, rename does not wait for this flush to complete, and therefore provides no atomicity guarantee—it is possible to end up with only partial new content after a crash. Of the file systems we tested, btrfs is the only one that provides the replace-via-rename atomicity guarantee.

https://homes.cs.washington.edu/~lijl/papers/ferrite-asplos16.pdf


On btrfs, the documentation says replacing a file using rename() will provide full atomicity, and it does not need an explicit fsync() to protect the data against crashes. I think this was added around the same time as ext4 auto_da_alloc. We also see a claim that the btrfs implementation avoids degrading performance, as it does not cause the rename() call to wait. However I notice in recent kernels, at least if you used fsync(), the following rename() will fsync() the parent directory and wait for the whole "log tree" to be written.

bcachefs currently appears to provide the full level of protection, though I did not find any documentation. Check the code. I see a call to the function "filemap_write_and_wait_range"

XFS has rejected adding crash-safety workarounds for rename(). It apparently gained code that reduces (but does not remove) the risk of data loss in a different case.

UBIFS (used e.g. on many Openwrt devices) does not include any crash-safety workarounds for rename(). It could be accepted but would require a lot of work. http://www.linux-mtd.infradead.org/doc/ubifs.html#L_sync_exceptions