Directory Size – Find Total Size of Certain Files Within Directory Branch

directorydirectory-structurefilessize;

Assume there's an image storage directory, say, ./photos/john_doe, within which there are multiple subdirectories, where many certain files reside (say, *.jpg). How can I calculate a summary size of those files below the john_doe branch?

I tried du -hs ./photos/john_doe/*/*.jpg, but this shows individual files only. Also, this tracks only the first nest level of the john_doe directory, like john_doe/june/, but skips john_doe/june/outrageous/.

So, how could I traverse the entire branch, summing up the size of the certain files?

Best Answer

find ./photos/john_doe -type f -name '*.jpg' -exec du -ch {} + | grep total$

If more than one invocation of du is required because the file list is very long, multiple totals will be reported and need to be summed.

Related Solutions

Directory – How to Recursively Find Disk Usage

Try doing this :

du -s dir

du -sh dir

needs -h support, depends of your OS.

See

man du

Filesystems – Why Size Reporting for Directories Differs from Other Files

I think the reason you're confused is because you don't know what a directory is. To do this lets take a step back and examine how Unix filesystems work.

The Unix filesystem has several separate notions for addressing data on disk:

data blocks are a group of blocks on a disk which have the contents of a file.
inodes are special blocks on a filesystem, with a numerical address unique within that filesystem, which contains metadata about a file such as:
- permissions
- access / modification times
- size
- pointers to the data blocks (could be a list of blocks, extents, etc)
filenames are hierarchical locations on a filesystem root that are mapped to inodes.

In other words, a "file" is actually composed of three different things:

a PATH in the filesystem
an inode with metadata
data blocks pointed to by the inode

Most of the time, users imagine a file to be synonymous to "the entity associated with the filename" - it's only when you're dealing with low-level entities or the file/socket API that you think of inodes or data blocks. Directories are one of those low-level entities.

You might think that a directory is a file that contains a bunch of other files. That's only half-correct. A directory is a file that maps filenames to inode numbers. It doesn't "contain" files, but pointers to filenames. Think of it like a text file that contains entries like this:

. - inode 1234
.. - inode 200
Documents - inode 2008
README.txt - inode 2009

The entries above are called directory entries. They are basically mappings from filenames to inode numbers. A directory is a special file that contains directory entries.

That's a simplification of course, but it explains the basic idea and other directory weirdness.

Why don't directories know their own size?
- Because they only contain pointers to other stuff, you have to iterate over their contents to find the size
Why aren't directories ever empty?
- Because they contain at least the . and .. entries. Thus, a proper directory will be at least as small as the smallest filesize that can contain those entries. In most filesystems, 4096 bytes is the smallest.
Why is it that you need write permission on the parent directory when renaming a file?
- Because you're not just changing the file, you're changing the directory entry pointing to the file.
Why does ls show a weird number of "links" to a directory?
- a directory can be referenced (linked to) by itself, its parent, its children.
What does a hard link do and how does it differ from a symlink?
- a hard link adds a directory entry pointing to the same inode number. Because it points to an inode number, it can only point to files in the same filesystem (inodes are local to a filesystem)
- a symlink adds a new inode which points to a separate filename. Because it refers to a filename it can point to arbitrary files in the tree.

But wait! Weird things are happening!

ls -ld somedirectory always shows the filesize to be 4096, whereas ls -l somefile shows the actual size of a file. Why?

Point of confusion 1: when we say "size" we can be referring to two things:

filesize, which is a number stored in the inode; and
allocated size, which is the number of blocks associated with the inode times the size of each block.

In general, these are not the same number. Try running stat on a regular file and you'll see this difference.

When a filesystem creates a non-empty file, it usually eagerly allocates data blocks in groups. This is because files have a tendency to grow and shrink arbitrarily fast. If the filesystem only allocated as many data blocks as needed to represent the file, growing / shrinking would be slower, and fragmentation would be a serious concern. So in practice, filesystems don't have to keep reallocating space for small changes. This means that there may be a lot of space on disk that is "claimed" by files but completely unused.

What does the filesystem do with all this unused space? Nothing. Until it feels like it needs to. If your filesystem optimizer tool - maybe an online optimizer running in the background, maybe part of your fsck, maybe built-in to your filesystem itself - feels like it, it may reassign the data blocks of your files - moving used blocks, freeing unused blocks, etc.

So now we come to the difference between regular files and directories: because directories form the "backbone" of your filesystem, you expect that they may need to be accessed or modified frequently and should thus be optimized. And so you don't want them fragmented at all. When directories are created, they always max out all their data blocks in size, even when they only have so many directory entries. This is okay for directories, because, unlike files, directories are typically limited in size and growth rate.

The 4096 reported size of directories is the "filesize" number stored in the directory inode, not the number of entries in the directory. It isn't a fixed number - it's the maximum bytes that will fit into the allocated number of blocks for the directory. Typically, this is 512 bytes/block times 8 blocks allocated for a file with any contents - incidentally, for directories, the filesize and the allocated size are the same. Because it's allocated as a single group, the filesystem optimizer won't move its blocks around.

As the directory grows, more data blocks are assigned to it, and it will also max out those blocks by adjusting the filesize accordingly.

And so ls and stat will show the filesize field of the directory's inode, which is set to the size of the data blocks assigned to it.

Best Answer

Related Solutions

Directory – How to Recursively Find Disk Usage

Filesystems – Why Size Reporting for Directories Differs from Other Files

Related Question