With full data journaling, why does data appear in the directory immediately

ext3filesystemsjournaling

I've got a question regarding full data journaling on ext3 filesystems. The man page states the following:

data=journal
All data is committed into the journal prior to being written into
the main filesystem.

It seems to me that that means that a file is first saved to the journal and then copied to the filesystem.

I assumed that if I download something it should first be saved in the journal and if complete moved to FS. But after starting the download file appears in the directory (FS). What's wrong about that?

Edit: Maybe its wrong to think of "all Data" = whole size of the file? So if all data is maybe only a Block or something else than it would make sense and I couldn't see that things are first written to journal?!

Best Answer

First, you're right to suspect that “all data” doesn't mean the whole file. In fact, that layer of the filesystem operates on fixed-size file blocks, not on whole files. At that level, it's important to keep a bounded amount of data, so working on whole files (which can be arbitrary large) wouldn't work.

Second, there's a misconception in your question. The journaling behavior isn't something you can observe by looking at the directory contents with ls, it works at a much lower level. With normal tools, you'll always see that the file is there. (It would be catastrophic if creating a file didn't appear to, y'know, create it.) What happens under the hood is that the file can be stored in different ways. At first, the first few blocks are saved in the journal. Then, as soon as efficiently possible, the data is moved to its final location. It's still the same file in the same directory, just stored differently.

The only way you can observe journaling behavior is if you go and see exactly what the kernel is writing to the disk, or if you analyse the disk content after a crash. In normal operation, the journal is an implementation detail: if you could see it in action (other than performance-wise), it would be severely broken.

For more information about filesystem journals, I recommend starting with the Wikipedia article. In ext3 terms, a data=journal ensures that if the system crashes, each file is in a state that it had at some point before the crash (it's not always the latest state because of buffering). The reason this doesn't happen automatically is that the kernel reorders disk writes for efficiency (it can make a big difference). This is called a “physical journal” in the Wikipedia article. The other two modes (data=ordered and data=writeback) are forms of “logical journal”: they're faster, but they can lead to corrupted files. The journal limits the risk of corruption to a few files containing garbage; ext3 always uses a full journal for metadata. Without a journal for metadata, metadata can get lost, leading to major filesystem corruption. Furthermore, without a journal, recovery after a crash requires a full filesystem integrity check, whereas with a journal recovery means replaying a few journal entries.

Note that even with a journal, typical unix filesystems don't guarantee global filesystem consistency, only per-file consistency at most. That is, suppose you write to file foo, then you write to file bar, then the system crashes. It's possible for bar to have the new contents but foo to still have the old contents. To have complete consistency, you need a transactional filesystem.

Related Question