Good block size for file cache on Linux

block-devicecachefilesystems

The device block size is usually 512 bytes while the filesystem block size is often 4096 bytes. Why are they different? Why are 512B and 4KB good choices for device and filesystem block sizes? What block size would work best for caching disk reads in a userspace library?

Best Answer

The device block size is the block size with what the system is talking with the HDD controllers. If you want to read/write the HDD, it happens so:

  1. Read:

    1. CPU -> HDD controller: "Please send me the data of block 43623626"
    2. HDD controller -> CPU: "Done, here it is: 0xfce2c0deebed..."
  2. Write:

    1. CPU -> HDD controller: "Please write this data to block 3452345: 0xfce2c0deebed..."
    2. HDD controller -> CPU: "done"

Here the block number means the name of the 2354242th, 512-byte block.

Theoretically, it could be possible to use any block size. Most devices are using 512-byte blocks, and some of them, particularly large HDDs are using 4096-byte blocks. Some optical media are using 2304byte blocks.

The important thing is: the block device controller doesn't know anything from the filesystem on it. It can only read and write blocks, in its block size, to his medium. This is what the block device driver uses to provide the block device for the kernel: essentially a single, large byte array. It doesn't matter, how is it partitioned or which fs is using it.

The filesystem block size is the block size in which the filesystem data structures are organized in the filesystem. It is the internal feature of the filesystem, there isn't even a requirement to use block-oriented data structures, and some filesystems doesn't even do it.

Ext4 uses most typically 4096byte blocks.

Furthermore, disk IO data is handled typically not directly by the processes, but with the virtual memory of your OS. It uses extensively paging. The VM page size is typically 4096 bytes (might be different on non-x86 CPUs), it is determined by the CPU architecture. (For example, newer amd64 CPUs can handle 2MB pages, or dec alpha used 8192 byte pages).

To optimize the data IO, the best if all of them are the multiply of eachother, yet better if they are equal. This typically means: use 4096 byte fs blocks.

It is also important: if your block device is partitioned, the partitions should begin/end of exact page sizes. If you don't do it, for example your sda1 starts on the 17. block of your sda, the CPU will have to issue TWO read/write commands for all page read/write operations, because the physical and the filesystem blocks will overlap.

In the most common scenario, it means: all partitions should start or begin on a sector divisible by 8 (4096 / 512 = 8).

Note, typically the low level block IO happens not in single block read/write operations, instead multiple blocks are sent/received in a single command. And re-organizing data is typically not a very big overhead, because memory IO is typically much faster that block device IO. Thus, not following these won't be a big overhead.

Related Question