I think the reason you're confused is because you don't know what a directory is. To do this lets take a step back and examine how Unix filesystems work.
The Unix filesystem has several separate notions for addressing data on disk:
- data blocks are a group of blocks on a disk which have the contents of a file.
- inodes are special blocks on a filesystem, with a numerical address unique within that filesystem, which contains metadata about a file such as:
- permissions
- access / modification times
- size
- pointers to the data blocks (could be a list of blocks, extents, etc)
- filenames are hierarchical locations on a filesystem root that are mapped to inodes.
In other words, a "file" is actually composed of three different things:
- a PATH in the filesystem
- an inode with metadata
- data blocks pointed to by the inode
Most of the time, users imagine a file to be synonymous to "the entity associated with the filename" - it's only when you're dealing with low-level entities or the file/socket API that you think of inodes or data blocks. Directories are one of those low-level entities.
You might think that a directory is a file that contains a bunch of other files. That's only half-correct. A directory is a file that maps filenames to inode numbers. It doesn't "contain" files, but pointers to filenames. Think of it like a text file that contains entries like this:
- . - inode 1234
- .. - inode 200
- Documents - inode 2008
- README.txt - inode 2009
The entries above are called directory entries. They are basically mappings from filenames to inode numbers. A directory is a special file that contains directory entries.
That's a simplification of course, but it explains the basic idea and other directory weirdness.
- Why don't directories know their own size?
- Because they only contain pointers to other stuff, you have to iterate over their contents to find the size
- Why aren't directories ever empty?
- Because they contain at least the . and .. entries. Thus, a proper directory will be at least as small as the smallest filesize that can contain those entries. In most filesystems, 4096 bytes is the smallest.
- Why is it that you need write permission on the parent directory when renaming a file?
- Because you're not just changing the file, you're changing the directory entry pointing to the file.
- Why does ls show a weird number of "links" to a directory?
- a directory can be referenced (linked to) by itself, its parent, its children.
- What does a hard link do and how does it differ from a symlink?
- a hard link adds a directory entry pointing to the same inode number. Because it points to an inode number, it can only point to files in the same filesystem (inodes are local to a filesystem)
- a symlink adds a new inode which points to a separate filename. Because it refers to a filename it can point to arbitrary files in the tree.
But wait! Weird things are happening!
ls -ld somedirectory
always shows the filesize to be 4096, whereas ls -l somefile
shows the actual size of a file. Why?
Point of confusion 1: when we say "size" we can be referring to two things:
- filesize, which is a number stored in the inode; and
- allocated size, which is the number of blocks associated with the inode times the size of each block.
In general, these are not the same number. Try running stat
on a regular file and you'll see this difference.
When a filesystem creates a non-empty file, it usually eagerly allocates data blocks in groups. This is because files have a tendency to grow and shrink arbitrarily fast. If the filesystem only allocated as many data blocks as needed to represent the file, growing / shrinking would be slower, and fragmentation would be a serious concern. So in practice, filesystems don't have to keep reallocating space for small changes. This means that there may be a lot of space on disk that is "claimed" by files but completely unused.
What does the filesystem do with all this unused space? Nothing. Until it feels like it needs to. If your filesystem optimizer tool - maybe an online optimizer running in the background, maybe part of your fsck, maybe built-in to your filesystem itself - feels like it, it may reassign the data blocks of your files - moving used blocks, freeing unused blocks, etc.
So now we come to the difference between regular files and directories: because directories form the "backbone" of your filesystem, you expect that they may need to be accessed or modified frequently and should thus be optimized. And so you don't want them fragmented at all. When directories are created, they always max out all their data blocks in size, even when they only have so many directory entries. This is okay for directories, because, unlike files, directories are typically limited in size and growth rate.
The 4096 reported size of directories is the "filesize" number stored in the directory inode, not the number of entries in the directory. It isn't a fixed number - it's the maximum bytes that will fit into the allocated number of blocks for the directory. Typically, this is 512 bytes/block times 8 blocks allocated for a file with any contents - incidentally, for directories, the filesize and the allocated size are the same. Because it's allocated as a single group, the filesystem optimizer won't move its blocks around.
As the directory grows, more data blocks are assigned to it, and it will also max out those blocks by adjusting the filesize accordingly.
And so ls
and stat
will show the filesize field of the directory's inode, which is set to the size of the data blocks assigned to it.
Best Answer
If more than one invocation of
du
is required because the file list is very long, multiple totals will be reported and need to be summed.