BSD people is really hardcore and does often surprising things :-) Removing the block device layer is in my opinion not a problem (for example, nfs also doesn't even have an underlying block device), but this reasoning is not against the block devices, but against the write caching. And removing the write cache is on my opinion a bad, very bad thing. If your process writes something to the disk, you don't get back the control until it didn't succeed?
But I don't think that they didn't know what they do. Hopefully somebody will explain their reasons in another answer.
To explain this clearly, I need to explain, how the filesystems work. A filesystem driver is essentially a translation layer between the filesystem operations (directory open, file creation, read-write, deletion, etc) and between the block operations (for example: "write out the page 0xfce2ea31 to the disk block 0xc0deebed").
But the block operations don't reach the hard disk on the spot. First, they are going to the block cache. Which means, if the filesystem wants to write a memory page into the disk, first it writes into a reserved memory area. The memory management of the kernel will write this data out into the hard disk, if it thinks it is optimal. This enables various speed improvements: for example, is many write operation happens to the beginning and to the end of the disk, the kernel can combine them on a such way, that the disk head must reposition itself so seldom, as possible.
There is another improvement: if your program writes into a file, it will experience a so fast operation, as if it would be a ramdisk. Of course it is only possible until the RAM of the system won't be full, after that they must wait for the emptying of the write cache. But it happens only if there is a lot of write operation at once (for example, you are copying large files).
In case of filesystems, there is a big difference between the filesystems which are running on a disk (i.e. block devices) and which isn't (f.e. nfs). In case of the second, there is no possibility for block caching, because there are no blocks. In their case there is a so-named "buffer cache", which essentially means still caching (both read and write), but it is not organized around memory blocks, but I/O fragments of any size.
Yes, in Linux, there are the "raw" block devices which enable the usage of disk devices without the block caching mechanism. But this problem isn't solved by them.
Instead of it, there are the so-named "journaling filesystems". In the case of the journaling filesystems, the filesystem has the opportunity to instruct the kernel, which pages must be written out before others. If there is no journaling mechanism in a filesystem, then it only writes blocks to the disk (more precisely: to the block cache), and the kernel will actually execute the real write operation if it thinks it is optimal.
You can imagine the journaling filesystems as if every write operation would happen twice: first, into a "journal", which is a reserved area on the disk, and only after that to its real location. In case of a system crash or disk error, the content of the last, undamaged state of the disk can very fast and easily reconstructed on the journal.
But this significantly decreases the write performance, because every write must be done twice. This is why in the reality the journaling filesystems work on a much complexer way, they are using various, complex datastructure manipulations to reduce this overhead to a nearly invisible level. But this is hard: for example, the major improvement of ext3 against ext2 was the inclusion of journaling, which multiplied its code size.
In Linux, the block layer API has a "barrier" mechanism. The filesystems can set up "barriers" between their write operations. A barrier means, that data after the barrier will be written into the disk only after every data before the barrier was already written out. The journaling filesystems are using the barrier mechanism to instruct the block layer about the needed ordering of the actual write operations. As I know, they don't use the raw device mapping.
I don't know, what FreeBSD does about the case. Maybe their elimination of the block devices means only that everything will go with buffer cache and not with block cache. Or they have something, which isn't written here. In the filesystem internals, there are very big differences between the *BSD and the Linux world.
The device block size is the block size with what the system is talking with the HDD controllers. If you want to read/write the HDD, it happens so:
Read:
- CPU -> HDD controller: "Please send me the data of block 43623626"
- HDD controller -> CPU: "Done, here it is: 0xfce2c0deebed..."
Write:
- CPU -> HDD controller: "Please write this data to block 3452345: 0xfce2c0deebed..."
- HDD controller -> CPU: "done"
Here the block number means the name of the 2354242th, 512-byte block.
Theoretically, it could be possible to use any block size. Most devices are using 512-byte blocks, and some of them, particularly large HDDs are using 4096-byte blocks. Some optical media are using 2304byte blocks.
The important thing is: the block device controller doesn't know anything from the filesystem on it. It can only read and write blocks, in its block size, to his medium. This is what the block device driver uses to provide the block device for the kernel: essentially a single, large byte array. It doesn't matter, how is it partitioned or which fs is using it.
The filesystem block size is the block size in which the filesystem data structures are organized in the filesystem. It is the internal feature of the filesystem, there isn't even a requirement to use block-oriented data structures, and some filesystems doesn't even do it.
Ext4 uses most typically 4096byte blocks.
Furthermore, disk IO data is handled typically not directly by the processes, but with the virtual memory of your OS. It uses extensively paging. The VM page size is typically 4096 bytes (might be different on non-x86 CPUs), it is determined by the CPU architecture. (For example, newer amd64 CPUs can handle 2MB pages, or dec alpha used 8192 byte pages).
To optimize the data IO, the best if all of them are the multiply of eachother, yet better if they are equal. This typically means: use 4096 byte fs blocks.
It is also important: if your block device is partitioned, the partitions should begin/end of exact page sizes. If you don't do it, for example your sda1 starts on the 17. block of your sda, the CPU will have to issue TWO read/write commands for all page read/write operations, because the physical and the filesystem blocks will overlap.
In the most common scenario, it means: all partitions should start or begin on a sector divisible by 8 (4096 / 512 = 8).
Note, typically the low level block IO happens not in single block read/write operations, instead multiple blocks are sent/received in a single command. And re-organizing data is typically not a very big overhead, because memory IO is typically much faster that block device IO. Thus, not following these won't be a big overhead.
Best Answer
I'm going to go out on a limb here and say that no such tool currently exists.