Good block size for file cache on Linux

block-devicecachefilesystems

The device block size is usually 512 bytes while the filesystem block size is often 4096 bytes. Why are they different? Why are 512B and 4KB good choices for device and filesystem block sizes? What block size would work best for caching disk reads in a userspace library?

Best Answer

The device block size is the block size with what the system is talking with the HDD controllers. If you want to read/write the HDD, it happens so:

Read:
1. CPU -> HDD controller: "Please send me the data of block 43623626"
2. HDD controller -> CPU: "Done, here it is: 0xfce2c0deebed..."
Write:
1. CPU -> HDD controller: "Please write this data to block 3452345: 0xfce2c0deebed..."
2. HDD controller -> CPU: "done"

Here the block number means the name of the 2354242th, 512-byte block.

Theoretically, it could be possible to use any block size. Most devices are using 512-byte blocks, and some of them, particularly large HDDs are using 4096-byte blocks. Some optical media are using 2304byte blocks.

The important thing is: the block device controller doesn't know anything from the filesystem on it. It can only read and write blocks, in its block size, to his medium. This is what the block device driver uses to provide the block device for the kernel: essentially a single, large byte array. It doesn't matter, how is it partitioned or which fs is using it.

The filesystem block size is the block size in which the filesystem data structures are organized in the filesystem. It is the internal feature of the filesystem, there isn't even a requirement to use block-oriented data structures, and some filesystems doesn't even do it.

Ext4 uses most typically 4096byte blocks.

Furthermore, disk IO data is handled typically not directly by the processes, but with the virtual memory of your OS. It uses extensively paging. The VM page size is typically 4096 bytes (might be different on non-x86 CPUs), it is determined by the CPU architecture. (For example, newer amd64 CPUs can handle 2MB pages, or dec alpha used 8192 byte pages).

To optimize the data IO, the best if all of them are the multiply of eachother, yet better if they are equal. This typically means: use 4096 byte fs blocks.

It is also important: if your block device is partitioned, the partitions should begin/end of exact page sizes. If you don't do it, for example your sda1 starts on the 17. block of your sda, the CPU will have to issue TWO read/write commands for all page read/write operations, because the physical and the filesystem blocks will overlap.

In the most common scenario, it means: all partitions should start or begin on a sector divisible by 8 (4096 / 512 = 8).

Note, typically the low level block IO happens not in single block read/write operations, instead multiple blocks are sent/received in a single command. And re-organizing data is typically not a very big overhead, because memory IO is typically much faster that block device IO. Thus, not following these won't be a big overhead.

Related Solutions

Sparse Files – Understanding File Holes and Block Size

Ext4 can use 1kB, 2kB or 4kB as the block size; as far as I know the default on Ubuntu is 4kB. Note that here, a block is the size of a file chunk, which is constant for a given filesystem. The file you describe has two blocks that are not zeroes: the one containing hello (surrounded by a bunch of zeroes — 3616 before and 474 after), and the one containing here (preceded by a bunch of zeroes, and containing only 3148 bytes, after which the end of the file is reached). The total is two blocks of 4kB.

In the ls output, blocks are an arbitrary unit chosen by the ls command and defaulting to 1kB. There are 2 blocks of 4kB each allocated to contain file data, therefore the allocated size for the file is 8kB.

Your confusion may be due to two things. First, the figure of 2048 bytes for a block is possible, but it's not the default value under Ubuntu (or most modern distributions), and it's apparently not the value on your system. You can check the block size by running tune2fs -l /dev/sdz42 (use the actual path to your filesystem device).

Second, sparse files consist of not storing blocks that are entirely made of zeroes. If a block (which is of necessity aligned on a block size boundary, at least for most filesystems including ext4) contains zeroes and other things, then the full block is stored on the disk. Thus, in that 40012-byte file (how did you get to 40013, by the way), there are 4 all-zero non-stored blocks, then one stored block containing hello surrounded by zeroes, then 4 more all-zero non-stored blocks, and a final partial block containing zeroes and there.

Note that your utility can be written in terms of standard shell commands:

n=20000
while IFS= read -r line; do
  dd bs=1 seek=$n </dev/null
  echo "$line"
done >testfile

Stat, Blocks and Sector size – ext4

The links you give explicitly state:

The st_blocks field indicates the number of blocks allocated to the file, 512-byte units.

So they're always in units of 512-byte blocks, regardless of what underlying device is used. The stat command simply displays what the stat system call returns. The 512-byte block is a historic thing, defined in POSIX. Compare for example these:

$ ls -s smallfile.txt
4 smallfile.txt
$ env POSIXLY_CORRECT=1 ls -s smallfile.txt
8 smallfile.txt

GNU ls displays blocks by default in 1kB blocks, but when forced to comply with POSIX it shows 512-byte blocks.

Best Answer

Related Solutions

Sparse Files – Understanding File Holes and Block Size

Stat, Blocks and Sector size – ext4

Related Question