Linux – How does Linux handle block devices

block-devicecachelinux

Today I learned that FreeBSD removed support for block devices entirely. While reading their rational for the decision I came across this:

Block devices are disk devices for which the kernel provides caching. This caching makes block-devices almost unusable, or at least dangerously unreliable. The caching will reorder the sequence of write operations, depriving the application of the ability to know the exact disk contents at any one instant in time. This makes predictable and reliable crash recovery of on-disk data structures (filesystems, databases etc.) impossible. Since writes may be delayed, there is no way the kernel can report to the application which particular write operation encountered a write error, this further compounds the consistency problem.

(From https://www.freebsd.org/doc/en_US.ISO8859-1/books/arch-handbook/driverbasics-block.html)

However I know that Linux almost exclusively uses block devices (though one CAN request a raw one).

How then does Linux circumvent the issues mentioned in this quote? Or do most drivers just request a raw device instead?

Best Answer

BSD people is really hardcore and does often surprising things :-) Removing the block device layer is in my opinion not a problem (for example, nfs also doesn't even have an underlying block device), but this reasoning is not against the block devices, but against the write caching. And removing the write cache is on my opinion a bad, very bad thing. If your process writes something to the disk, you don't get back the control until it didn't succeed?

But I don't think that they didn't know what they do. Hopefully somebody will explain their reasons in another answer.

To explain this clearly, I need to explain, how the filesystems work. A filesystem driver is essentially a translation layer between the filesystem operations (directory open, file creation, read-write, deletion, etc) and between the block operations (for example: "write out the page 0xfce2ea31 to the disk block 0xc0deebed").

But the block operations don't reach the hard disk on the spot. First, they are going to the block cache. Which means, if the filesystem wants to write a memory page into the disk, first it writes into a reserved memory area. The memory management of the kernel will write this data out into the hard disk, if it thinks it is optimal. This enables various speed improvements: for example, is many write operation happens to the beginning and to the end of the disk, the kernel can combine them on a such way, that the disk head must reposition itself so seldom, as possible.

There is another improvement: if your program writes into a file, it will experience a so fast operation, as if it would be a ramdisk. Of course it is only possible until the RAM of the system won't be full, after that they must wait for the emptying of the write cache. But it happens only if there is a lot of write operation at once (for example, you are copying large files).

In case of filesystems, there is a big difference between the filesystems which are running on a disk (i.e. block devices) and which isn't (f.e. nfs). In case of the second, there is no possibility for block caching, because there are no blocks. In their case there is a so-named "buffer cache", which essentially means still caching (both read and write), but it is not organized around memory blocks, but I/O fragments of any size.

Yes, in Linux, there are the "raw" block devices which enable the usage of disk devices without the block caching mechanism. But this problem isn't solved by them.

Instead of it, there are the so-named "journaling filesystems". In the case of the journaling filesystems, the filesystem has the opportunity to instruct the kernel, which pages must be written out before others. If there is no journaling mechanism in a filesystem, then it only writes blocks to the disk (more precisely: to the block cache), and the kernel will actually execute the real write operation if it thinks it is optimal.

You can imagine the journaling filesystems as if every write operation would happen twice: first, into a "journal", which is a reserved area on the disk, and only after that to its real location. In case of a system crash or disk error, the content of the last, undamaged state of the disk can very fast and easily reconstructed on the journal.

But this significantly decreases the write performance, because every write must be done twice. This is why in the reality the journaling filesystems work on a much complexer way, they are using various, complex datastructure manipulations to reduce this overhead to a nearly invisible level. But this is hard: for example, the major improvement of ext3 against ext2 was the inclusion of journaling, which multiplied its code size.

In Linux, the block layer API has a "barrier" mechanism. The filesystems can set up "barriers" between their write operations. A barrier means, that data after the barrier will be written into the disk only after every data before the barrier was already written out. The journaling filesystems are using the barrier mechanism to instruct the block layer about the needed ordering of the actual write operations. As I know, they don't use the raw device mapping.

I don't know, what FreeBSD does about the case. Maybe their elimination of the block devices means only that everything will go with buffer cache and not with block cache. Or they have something, which isn't written here. In the filesystem internals, there are very big differences between the *BSD and the Linux world.