Why is 67108864 the maximum bytes-per-inode ratio? Why is there a max

ext4hard-diskinodemkfs

Formatting a disk for purely large video files, I calculated what I thought was an appropriate bytes-per-inode value, in order to maximise usable disk space.

I was greeted, however, with:

mkfs.ext4: invalid inode ratio [RATIO] (min 1024/max 67108864)

I assume the minimum is derived from what could even theoretically be used – no point having more inodes than could ever be utilised.

But where does the maximum come from? mkfs doesn't know the size of files I'll put on the filesystem it creates – so unless it was to be {disk size} - {1 inode size} I don't understand why we have a maximum at all, much less one as low as 67MB.

Best Answer

Because of the way the filesystem is built. It's a bit messy, and by default, you can't even have the ratio as down as 1/64 MB.

From the Ext4 Disk Layout document on kernel.org, we see that the file system internals are tied to the block size (4 kB by default), which controls both the size of a block group, and the amount of inodes in a block group. A block group has a one-block sized bitmap of the blocks in the group, and a minimum of one block of inodes.

Because of the bitmap, the maximum block group size is 8 blocks * block size in bytes, so on an FS with 4 kB blocks, the block groups are 32768 blocks or 128 MB in size. The inodes take one block at minimum, so for 4 kB blocks, you get at least (4096 B/block) / (256 B/inode) = 16 inodes/block or 16 inodes per 128 MB, or 1 inode per 8 MB.

At 256 B/inode, that's 256 B / 8 MB, or 1 byte per 32 kB, or about 0,003 % of the total size, for the inodes.

Decreasing the number of inodes would not help, you'd just get a partially-filled inode block. Also, the size of an inode doesn't really matter either, since the allocation is done by block. It's the block group size that's the real limit for the metadata.


Increasing the block size would help, and in theory, the maximum block group size increases in the square of the block size (except that it seems to cap at a bit less than 64k blocks/group). But you can't use a block size greater than the page size of the system, so on x86, you're stuck with 4 kB blocks.


However, there's the bigalloc feature that's exactly what you want:

for a filesystem of mostly huge files, it is desirable to be able to allocate disk blocks in units of multiple blocks to reduce both fragmentation and metadata overhead. The bigalloc feature provides exactly this ability.

The administrator can set a block cluster size at mkfs time (which is stored in the s_log_cluster_size field in the superblock); from then on, the block bitmaps track clusters, not individual blocks. This means that block groups can be several gigabytes in size (instead of just 128MiB); however, the minimum allocation unit becomes a cluster, not a block, even for directories.

You can enable that with mkfs.ext4 -Obigalloc, and set the cluster size with -C<bytes>, but mkfs does note that:

Warning: the bigalloc feature is still under development
See https://ext4.wiki.kernel.org/index.php/Bigalloc for more information

There are mentions of issues in combination with delayed allocation on that page and the ext4 man page, and the words "huge risk" also appear on the Bigalloc wiki page.


None of that has anything to do with that 64 MB / inode limit set by the -i option. It appears to just be an arbitrary limit set at the interface level. The number of inodes can also be set directly with the -N option, and when that's used, there are no checks. Also, the upper limit is based on the maximum block size of the file system, not the block size actually chosen as the structural limits are.

Because of the 64k blocks/group limit, without bigalloc there's no way to get as few inodes as the ratio of 64 MB / inode would imply, and with bigalloc, the number of inodes can be set much lower than it.

Related Question