Why is 67108864 the maximum bytes-per-inode ratio? Why is there a max

ext4hard-diskinodemkfs

Formatting a disk for purely large video files, I calculated what I thought was an appropriate bytes-per-inode value, in order to maximise usable disk space.

I was greeted, however, with:

mkfs.ext4: invalid inode ratio [RATIO] (min 1024/max 67108864)

I assume the minimum is derived from what could even theoretically be used – no point having more inodes than could ever be utilised.

But where does the maximum come from? mkfs doesn't know the size of files I'll put on the filesystem it creates – so unless it was to be {disk size} - {1 inode size} I don't understand why we have a maximum at all, much less one as low as 67MB.

Best Answer

Because of the way the filesystem is built. It's a bit messy, and by default, you can't even have the ratio as down as 1/64 MB.

From the Ext4 Disk Layout document on kernel.org, we see that the file system internals are tied to the block size (4 kB by default), which controls both the size of a block group, and the amount of inodes in a block group. A block group has a one-block sized bitmap of the blocks in the group, and a minimum of one block of inodes.

Because of the bitmap, the maximum block group size is 8 blocks * block size in bytes, so on an FS with 4 kB blocks, the block groups are 32768 blocks or 128 MB in size. The inodes take one block at minimum, so for 4 kB blocks, you get at least (4096 B/block) / (256 B/inode) = 16 inodes/block or 16 inodes per 128 MB, or 1 inode per 8 MB.

At 256 B/inode, that's 256 B / 8 MB, or 1 byte per 32 kB, or about 0,003 % of the total size, for the inodes.

Decreasing the number of inodes would not help, you'd just get a partially-filled inode block. Also, the size of an inode doesn't really matter either, since the allocation is done by block. It's the block group size that's the real limit for the metadata.

Increasing the block size would help, and in theory, the maximum block group size increases in the square of the block size (except that it seems to cap at a bit less than 64k blocks/group). But you can't use a block size greater than the page size of the system, so on x86, you're stuck with 4 kB blocks.

However, there's the bigalloc feature that's exactly what you want:

for a filesystem of mostly huge files, it is desirable to be able to allocate disk blocks in units of multiple blocks to reduce both fragmentation and metadata overhead. The bigalloc feature provides exactly this ability.

The administrator can set a block cluster size at mkfs time (which is stored in the s_log_cluster_size field in the superblock); from then on, the block bitmaps track clusters, not individual blocks. This means that block groups can be several gigabytes in size (instead of just 128MiB); however, the minimum allocation unit becomes a cluster, not a block, even for directories.

You can enable that with mkfs.ext4 -Obigalloc, and set the cluster size with -C<bytes>, but mkfs does note that:

Warning: the bigalloc feature is still under development
See https://ext4.wiki.kernel.org/index.php/Bigalloc for more information

There are mentions of issues in combination with delayed allocation on that page and the ext4 man page, and the words "huge risk" also appear on the Bigalloc wiki page.

None of that has anything to do with that 64 MB / inode limit set by the -i option. It appears to just be an arbitrary limit set at the interface level. The number of inodes can also be set directly with the -N option, and when that's used, there are no checks. Also, the upper limit is based on the maximum block size of the file system, not the block size actually chosen as the structural limits are.

Because of the 64k blocks/group limit, without bigalloc there's no way to get as few inodes as the ratio of 64 MB / inode would imply, and with bigalloc, the number of inodes can be set much lower than it.

-i bytes-per-inode (aka inode_ratio)

For some unknown reason this parameter is sometime documented as bytes-per-inode and sometime as inode_ratio. According to the documentation, this is the bytes/inode ratio. Most human will have a better understanding when stated as either (excuse my english):

1 inode for every X bytes of storage (where X is bytes-per-inode).
lowest average-filesize you can fit.

The formula (taken from the mke2fs source code):

inode_count = (blocks_count * blocksize) / inode_ratio

Or even simplified (assuming "partition size" is roughly equivalent to blocks_count * blocksize, I haven't checked the allocation):

inode_count = (partition_size_in_bytes) / inode_ratio

Note 1: Even if you provide a fixed number of inode at FS creation time (mkfs -N ...), the value is converted into a ratio, so you can fit more inode as you extend the size of the filesystem.

Note 2: If you tune this ratio, make sure to allocate significantly more inode than what you plan to use... you really don't want to reformat your filesystem.

-I inode-size

This is the number of byte the filesystem will allocate/reserve for each inode the filesystem may have. The space is used to store the attributes of the inode (read Intro to Inodes). In Ext3, the default size was 128. In Ext4, the default size is 256 (to store extra_isize and provide space for inline extended-attributes). read Linux: Why change inode size?

Note: X bytes of disjkspace is allocated for each allocated inode, whether is free or used, where X=inode-size.

Filesystems – Why Does ‘/’ Have the Inode 2?

Inode 0 is used as a NULL value to indicate that there is no inode.

Inode 1 is used to keep track of any bad blocks on the disk; it is essentially a hidden file containing the bad blocks. Those bad blocks which are recorded using e2fsck -c.

Inode 2 is used by the root directory, and indicates starting of filesystem inodes.

Best Answer

Related Solutions

Linux – the difference between “inode size” and “Bytes per inode”

-i bytes-per-inode (aka inode_ratio)

-I inode-size

Filesystems – Why Does ‘/’ Have the Inode 2?

Related Question