Why size reporting for directories is different than other files

filesfilesystemsls

I was wondering why an empty directory occupied 4096 bytes of space and I have seen this question. It is stated that space is allocated in blocks and hence, the size of a new directory is 4096 bytes.

However I am pretty sure that allocation for "normal" files are done in blocks as well. At least it is like that in Windows filesystems and I am guessing that it must be at least similar in ext*.

Now as far as I understood, size listing for other types of files, such as files, symbolic links etc. are done in terms of real size. Because when I create an empty file, I see a 0 as the size. When a type a few characters, I see the < number of characters > bytes as the size etc.

So my question is, although the allocation for other files are done in blocks too, why the policy for reporting the size of a directory and a file differs?

Clarification

I thought the question was clear enough but apparently is wasn't. I will try to clarify the question here.

1) What I think a directory is:

I will try to explain what I think a directory is by the following example. After reading, if it is wrong, please notify me.

Let's say that we have a directory named mydir. And let's say that it contains 3 files, which are: f0, f1 and f2. Let's assume that each file is 1 byte long.

Now, what is mydir? It is a pointer to an inode which contains the following: String "f0" and the inode number which f0 points to. String "f1" and the inode number which f1 points to. And string "f2" and the inode number which f2 points to. (At least this is what I think a directory is. Please correct me if I am wrong.)

Now there may be two methods for calculating the size of a directory:

1) Calculating the size of the inode which mydir points to.

2) Summing the sizes of the inodes which contents of mydir points to.

Although 1 is more counter intuitive, let's assume that it is the method that is being used. (For this question, which method is the method that is actually being used does not matter.) Then, the size of mydir is calculated as the following:

2 + 2 + 2 + 3 * <space_required_to_store_an_inode_number>

2's are because each filename is 2 bytes long.

2) The question:

Now the question: Assuming what I think a directory is correct, the reported size for mydir should be much much less than 4096, no matter method 1 or method 2 is being used to calculate its size.

Now, you will say that the reason it is reported 4096 bytes is because the allocation is done in blocks. Hence, the reported size that big.

But then I will say: Allocation is done in blocks for regular files as well. (See thrig's answer for reference) But nevertheless, their sizes are reported in real sizes. (1 byte if they contain 1 character, 2 bytes if they contain 2 characters etc.)

So my question is, why is the policy for reporting sizes of directories is such different than reporting sizes of regular files?

More clarification:

We know that the initial number of blocks allocated for a non-empty file and for an empty directory is both 8 blocks. (See thrig's answer) So even though allocation is made in the same number of blocks for both regular files and directories, why the reported size for a directory is much bigger?

Best Answer

I think the reason you're confused is because you don't know what a directory is. To do this lets take a step back and examine how Unix filesystems work.

The Unix filesystem has several separate notions for addressing data on disk:

data blocks are a group of blocks on a disk which have the contents of a file.
inodes are special blocks on a filesystem, with a numerical address unique within that filesystem, which contains metadata about a file such as:
- permissions
- access / modification times
- size
- pointers to the data blocks (could be a list of blocks, extents, etc)
filenames are hierarchical locations on a filesystem root that are mapped to inodes.

In other words, a "file" is actually composed of three different things:

a PATH in the filesystem
an inode with metadata
data blocks pointed to by the inode

Most of the time, users imagine a file to be synonymous to "the entity associated with the filename" - it's only when you're dealing with low-level entities or the file/socket API that you think of inodes or data blocks. Directories are one of those low-level entities.

You might think that a directory is a file that contains a bunch of other files. That's only half-correct. A directory is a file that maps filenames to inode numbers. It doesn't "contain" files, but pointers to filenames. Think of it like a text file that contains entries like this:

. - inode 1234
.. - inode 200
Documents - inode 2008
README.txt - inode 2009

The entries above are called directory entries. They are basically mappings from filenames to inode numbers. A directory is a special file that contains directory entries.

That's a simplification of course, but it explains the basic idea and other directory weirdness.

Why don't directories know their own size?
- Because they only contain pointers to other stuff, you have to iterate over their contents to find the size
Why aren't directories ever empty?
- Because they contain at least the . and .. entries. Thus, a proper directory will be at least as small as the smallest filesize that can contain those entries. In most filesystems, 4096 bytes is the smallest.
Why is it that you need write permission on the parent directory when renaming a file?
- Because you're not just changing the file, you're changing the directory entry pointing to the file.
Why does ls show a weird number of "links" to a directory?
- a directory can be referenced (linked to) by itself, its parent, its children.
What does a hard link do and how does it differ from a symlink?
- a hard link adds a directory entry pointing to the same inode number. Because it points to an inode number, it can only point to files in the same filesystem (inodes are local to a filesystem)
- a symlink adds a new inode which points to a separate filename. Because it refers to a filename it can point to arbitrary files in the tree.

But wait! Weird things are happening!

ls -ld somedirectory always shows the filesize to be 4096, whereas ls -l somefile shows the actual size of a file. Why?

Point of confusion 1: when we say "size" we can be referring to two things:

filesize, which is a number stored in the inode; and
allocated size, which is the number of blocks associated with the inode times the size of each block.

In general, these are not the same number. Try running stat on a regular file and you'll see this difference.

When a filesystem creates a non-empty file, it usually eagerly allocates data blocks in groups. This is because files have a tendency to grow and shrink arbitrarily fast. If the filesystem only allocated as many data blocks as needed to represent the file, growing / shrinking would be slower, and fragmentation would be a serious concern. So in practice, filesystems don't have to keep reallocating space for small changes. This means that there may be a lot of space on disk that is "claimed" by files but completely unused.

What does the filesystem do with all this unused space? Nothing. Until it feels like it needs to. If your filesystem optimizer tool - maybe an online optimizer running in the background, maybe part of your fsck, maybe built-in to your filesystem itself - feels like it, it may reassign the data blocks of your files - moving used blocks, freeing unused blocks, etc.

So now we come to the difference between regular files and directories: because directories form the "backbone" of your filesystem, you expect that they may need to be accessed or modified frequently and should thus be optimized. And so you don't want them fragmented at all. When directories are created, they always max out all their data blocks in size, even when they only have so many directory entries. This is okay for directories, because, unlike files, directories are typically limited in size and growth rate.

The 4096 reported size of directories is the "filesize" number stored in the directory inode, not the number of entries in the directory. It isn't a fixed number - it's the maximum bytes that will fit into the allocated number of blocks for the directory. Typically, this is 512 bytes/block times 8 blocks allocated for a file with any contents - incidentally, for directories, the filesize and the allocated size are the same. Because it's allocated as a single group, the filesystem optimizer won't move its blocks around.

As the directory grows, more data blocks are assigned to it, and it will also max out those blocks by adjusting the filesize accordingly.

And so ls and stat will show the filesize field of the directory's inode, which is set to the size of the data blocks assigned to it.

Related Solutions

Sparse files/file holes and unexpected block size

Ext4 can use 1kB, 2kB or 4kB as the block size; as far as I know the default on Ubuntu is 4kB. Note that here, a block is the size of a file chunk, which is constant for a given filesystem. The file you describe has two blocks that are not zeroes: the one containing hello (surrounded by a bunch of zeroes — 3616 before and 474 after), and the one containing here (preceded by a bunch of zeroes, and containing only 3148 bytes, after which the end of the file is reached). The total is two blocks of 4kB.

In the ls output, blocks are an arbitrary unit chosen by the ls command and defaulting to 1kB. There are 2 blocks of 4kB each allocated to contain file data, therefore the allocated size for the file is 8kB.

Your confusion may be due to two things. First, the figure of 2048 bytes for a block is possible, but it's not the default value under Ubuntu (or most modern distributions), and it's apparently not the value on your system. You can check the block size by running tune2fs -l /dev/sdz42 (use the actual path to your filesystem device).

Second, sparse files consist of not storing blocks that are entirely made of zeroes. If a block (which is of necessity aligned on a block size boundary, at least for most filesystems including ext4) contains zeroes and other things, then the full block is stored on the disk. Thus, in that 40012-byte file (how did you get to 40013, by the way), there are 4 all-zero non-stored blocks, then one stored block containing hello surrounded by zeroes, then 4 more all-zero non-stored blocks, and a final partial block containing zeroes and there.

Note that your utility can be written in terms of standard shell commands:

n=20000
while IFS= read -r line; do
  dd bs=1 seek=$n </dev/null
  echo "$line"
done >testfile

File block size – difference between stat and ls

Many disks have a sector size of 512 bytes, meaning that any read or write on the disk transfers a whole 512-byte sector at a time. It is quite natural to design filesystems where a sector is not split between files (that would complicate the design and hurt performance); therefore filesystems tend to use 512-byte chunks for files. Hence traditional utilities such as ls and du indicate sizes in units of 512-byte chunks.

For humans, 512-byte units are not very meaningful. 1kB is the same order of magnitude and a lot more meaningful. A filesystem block (the smallest unit that a file is divided in) actually often consists of several sectors: 1kB, 2kB and 4kB are common filesystem block sizes; so the 512-byte unit is not strongly justified by the filesystem design, and there is no good reason other than tradition to use a 512-byte unit outside a disk driver at all.

So you have a tradition that doesn't have a lot going for it, and a more readable convention that's taking on. A bit like octal and hexadecimal: there isn't one that's right and one that's wrong, they're different ways of writing the same numbers.

Many tools have an option to select display units: ls --block-size=512 for GNU ls, setting POSIXLY_CORRECT=1 in the environment for GNU df and GNU du to get 512-byte units (or passing -k to force 1kB units). What the stat command in GNU coreutils exposes as the “block size” (the %B value) is an OS-dependant value of an internal interface; depending on the OS, it may or may not be related to a size used by the filesystem or disk code (it usually isn't — see Difference between block size and cluster size). On Linux, the value is 512, regardless of what any underlying driver is doing. The value of %B never matters, it's just a quirk that it exists at all.

Clarification

Best Answer

Related Solutions

Sparse files/file holes and unexpected block size

File block size – difference between stat and ls

Related Question