Hard links to directories aren't fundamentally different to hard links for files. In fact, many filesystems do have hard links on directories, but only in a very disciplined way.
In a filesystem that doesn't allow users to create hard links to directories, a directory's links are exactly
- the
.
entry in the directory itself;
- the
..
entries in all the directories that have this directory as their parent;
- one entry in the directory that
..
points to.
An additional constraint in such filesystems is that from any directory, following ..
nodes must eventually lead to the root. This ensures that the filesystem is presented as a single tree. This constraint is violated on filesystems that allow hard links to directories.
Filesystems that allow hard links to directories allow more cases than the three above. However they maintain the constraint that these cases do exist: a directory's .
always exists and points to itself; a directory's ..
always points to a directory that has it as an entry. Unlinking a directory entry that is a directory only removes it if it contains no entry other than .
and ..
.
Thus a dangling ..
cannot happen. What can go wrong is that a part of the filesystem can become detached. If a directory's ..
pointing to one of its descendants, so that ../../../..
eventually forms a loop. (As seen above, filesystems that don't allow hard link manipulations prevent this.) If all the paths from the root to such a directory are unlinked, the part of the filesystem containing this directory cannot be reached anymore, unless there are processes that still have their current directory on it. That part can't even be deleted since there's no way to get at it.
GCFS allows directory hard links and runs a garbage collector to delete such detached parts of the filesystem. You should read its specification, which addresses your concerns in details. This is an interesting intellectual exercise, but I don't know of any filesystem that's used in practice that provides garbage collection.
Because of the way the filesystem is built. It's a bit messy, and by default, you can't even have the ratio as down as 1/64 MB.
From the Ext4 Disk Layout document on kernel.org, we see that the file system internals are tied to the block size (4 kB by default), which controls both the size of a block group, and the amount of inodes in a block group. A block group has a one-block sized bitmap of the blocks in the group, and a minimum of one block of inodes.
Because of the bitmap, the maximum block group size is 8 blocks * block size in bytes
, so on an FS with 4 kB blocks, the block groups are 32768 blocks or 128 MB in size. The inodes take one block at minimum, so for 4 kB blocks, you get at least (4096 B/block) / (256 B/inode) = 16 inodes/block
or 16 inodes per 128 MB, or 1 inode per 8 MB.
At 256 B/inode, that's 256 B / 8 MB, or 1 byte per 32 kB, or about 0,003 % of the total size, for the inodes.
Decreasing the number of inodes would not help, you'd just get a partially-filled inode block. Also, the size of an inode doesn't really matter either, since the allocation is done by block. It's the block group size that's the real limit for the metadata.
Increasing the block size would help, and in theory, the maximum block group size increases in the square of the block size (except that it seems to cap at a bit less than 64k blocks/group). But you can't use a block size greater than the page size of the system, so on x86, you're stuck with 4 kB blocks.
However, there's the bigalloc
feature that's exactly what you want:
for a filesystem of mostly huge files, it is desirable to be able to allocate disk blocks in units of multiple blocks to reduce both fragmentation and metadata overhead. The bigalloc feature provides exactly this ability.
The administrator can set a block cluster size at mkfs time (which is stored in the s_log_cluster_size field in the superblock); from then on, the block bitmaps track clusters, not individual blocks. This means that block groups can be several gigabytes in size (instead of just 128MiB); however, the minimum allocation unit becomes a cluster, not a block, even for directories.
You can enable that with mkfs.ext4 -Obigalloc
, and set the cluster size with -C<bytes>
, but mkfs
does note that:
Warning: the bigalloc feature is still under development
See https://ext4.wiki.kernel.org/index.php/Bigalloc for more information
There are mentions of issues in combination with delayed allocation on that page and the ext4
man page, and the words "huge risk" also appear on the Bigalloc wiki page.
None of that has anything to do with that 64 MB / inode limit set by the -i
option. It appears to just be an arbitrary limit set at the interface level. The number of inodes can also be set directly with the -N
option, and when that's used, there are no checks. Also, the upper limit is based on the maximum block size of the file system, not the block size actually chosen as the structural limits are.
Because of the 64k blocks/group limit, without bigalloc
there's no way to get as few inodes as the ratio of 64 MB / inode would imply, and with bigalloc
, the number of inodes can be set much lower than it.
Best Answer
Say you did make the inode table a file; then the next question is... where do you store information about that file? You'd thus need "real" inodes and "extended" inodes, like an MS-DOS partition table. Given, you'd only need one (or maybe a few — e.g., to also have your journal be a file). But you'd actually have special cases, different code. Any corruption to that file would be disastrous, too. And consider that, before journaling, it was common for files that were being written e.g., when the power went out to be heavily damaged. Your file operations would have to be a lot more robust vs. power failure/crash/etc. than they were on, e.g., ext2.
Traditional Unix filesystems found a simpler (and more robust) solution: put an inode block (or group of blocks) every X blocks. Then you find them by simple arithmetic. Of course, then it's not possible to add more (without restructuring the entire filesystem). And even if you lose/corrupt the inode block you were writing to when the power failed, that's only losing a few inodes — far better than a substantial portion of the filesystem.
More modern designs use things like B-tree variants. Modern filesystems like btrfs, XFS, and ZFS do not suffer from inode limits.