Print out list of files less than specified file size

command linefileslsscriptingsearch

I'm trying to create a script that can be executed to search a list of files, compare the file size field to the specified file size, and then display the files that are less than the specified file size.

I understand that I must use 'ls -l' in order to get a detailed list of files. However, how can I go about searching the list and pulling the files?

Best Answer

Your approach is clumsy (and fair to say wrong). There are dedicated tools for these sort of tasks, find is one of those.

For example, to find all files in the current directory that are less than 1 MiB (1048576 Bytes), recursively:

find . -type f -size -1048576c

Or with shells that provide such size based glob qualifiers e.g. zsh, recursively:

print -rl -- **/*(.L-1048576)

Here, contrary to find above, without the hidden files. Add the D glob qualifier to include them.

Related Solutions

Shell – list all files newer than given timestamp and sort them

find supports a lot of date input formats. The simplest format to obtain is YYYYMMDD HH:MM:SS. You already have the digits in the right order, all you have to do is extract the first group (${timestamp%??????}: take all but the last 6 characters; ${timestamp#????????}: take all but the first 8 characters), and keep going, appending punctuation then the next group as you go along.

timestamp=20130207003851
timestring=${timestamp%??????}; timestamp=${timestamp#????????}
timestring="$timestring ${timestamp%????}"; timestamp=${timestamp#??}
timestring="$timestring:${timestamp%??}:${timestamp#??}"

In bash (and ksh and zsh), but not in ash, you can use the more readable ${STRING_VARIABLE:OFFSET:LENGTH} construct.

timestring="${timestamp:0:8} ${timestamp:8:2}:${timestamp:10:2}:${timestamp:12:2}"

To sort files by date, print out the file names preceded by the dates and sort that, then strip the date prefix. Use -printf to control the output format. %TX prints a part of the modification time determined by X; if X is @, you get the number of seconds since the Unix epoch. Below I print three tab-separated columns: the time in sortable format, the file name, and the time in human-readable format; cut -f 2- removes the first column and the call to expand replaces the tab by enough spaces to accommodate all expected file names (adjust 40 as desired). This code assumes you have no newlines or tabs in file names.

find -maxdepth 1 -type f \
     -newermt "$timestring" -printf '%T@\t%f\t%Tb %Td %TH:%TM\n' |
sort -k1n |
cut -f 2- |
expand -t 40

Filesystems – Why Size Reporting for Directories Differs from Other Files

I think the reason you're confused is because you don't know what a directory is. To do this lets take a step back and examine how Unix filesystems work.

The Unix filesystem has several separate notions for addressing data on disk:

data blocks are a group of blocks on a disk which have the contents of a file.
inodes are special blocks on a filesystem, with a numerical address unique within that filesystem, which contains metadata about a file such as:
- permissions
- access / modification times
- size
- pointers to the data blocks (could be a list of blocks, extents, etc)
filenames are hierarchical locations on a filesystem root that are mapped to inodes.

In other words, a "file" is actually composed of three different things:

a PATH in the filesystem
an inode with metadata
data blocks pointed to by the inode

Most of the time, users imagine a file to be synonymous to "the entity associated with the filename" - it's only when you're dealing with low-level entities or the file/socket API that you think of inodes or data blocks. Directories are one of those low-level entities.

You might think that a directory is a file that contains a bunch of other files. That's only half-correct. A directory is a file that maps filenames to inode numbers. It doesn't "contain" files, but pointers to filenames. Think of it like a text file that contains entries like this:

. - inode 1234
.. - inode 200
Documents - inode 2008
README.txt - inode 2009

The entries above are called directory entries. They are basically mappings from filenames to inode numbers. A directory is a special file that contains directory entries.

That's a simplification of course, but it explains the basic idea and other directory weirdness.

Why don't directories know their own size?
- Because they only contain pointers to other stuff, you have to iterate over their contents to find the size
Why aren't directories ever empty?
- Because they contain at least the . and .. entries. Thus, a proper directory will be at least as small as the smallest filesize that can contain those entries. In most filesystems, 4096 bytes is the smallest.
Why is it that you need write permission on the parent directory when renaming a file?
- Because you're not just changing the file, you're changing the directory entry pointing to the file.
Why does ls show a weird number of "links" to a directory?
- a directory can be referenced (linked to) by itself, its parent, its children.
What does a hard link do and how does it differ from a symlink?
- a hard link adds a directory entry pointing to the same inode number. Because it points to an inode number, it can only point to files in the same filesystem (inodes are local to a filesystem)
- a symlink adds a new inode which points to a separate filename. Because it refers to a filename it can point to arbitrary files in the tree.

But wait! Weird things are happening!

ls -ld somedirectory always shows the filesize to be 4096, whereas ls -l somefile shows the actual size of a file. Why?

Point of confusion 1: when we say "size" we can be referring to two things:

filesize, which is a number stored in the inode; and
allocated size, which is the number of blocks associated with the inode times the size of each block.

In general, these are not the same number. Try running stat on a regular file and you'll see this difference.

When a filesystem creates a non-empty file, it usually eagerly allocates data blocks in groups. This is because files have a tendency to grow and shrink arbitrarily fast. If the filesystem only allocated as many data blocks as needed to represent the file, growing / shrinking would be slower, and fragmentation would be a serious concern. So in practice, filesystems don't have to keep reallocating space for small changes. This means that there may be a lot of space on disk that is "claimed" by files but completely unused.

What does the filesystem do with all this unused space? Nothing. Until it feels like it needs to. If your filesystem optimizer tool - maybe an online optimizer running in the background, maybe part of your fsck, maybe built-in to your filesystem itself - feels like it, it may reassign the data blocks of your files - moving used blocks, freeing unused blocks, etc.

So now we come to the difference between regular files and directories: because directories form the "backbone" of your filesystem, you expect that they may need to be accessed or modified frequently and should thus be optimized. And so you don't want them fragmented at all. When directories are created, they always max out all their data blocks in size, even when they only have so many directory entries. This is okay for directories, because, unlike files, directories are typically limited in size and growth rate.

The 4096 reported size of directories is the "filesize" number stored in the directory inode, not the number of entries in the directory. It isn't a fixed number - it's the maximum bytes that will fit into the allocated number of blocks for the directory. Typically, this is 512 bytes/block times 8 blocks allocated for a file with any contents - incidentally, for directories, the filesize and the allocated size are the same. Because it's allocated as a single group, the filesystem optimizer won't move its blocks around.

As the directory grows, more data blocks are assigned to it, and it will also max out those blocks by adjusting the filesize accordingly.

And so ls and stat will show the filesize field of the directory's inode, which is set to the size of the data blocks assigned to it.

Best Answer

Related Solutions

Shell – list all files newer than given timestamp and sort them

Filesystems – Why Size Reporting for Directories Differs from Other Files

Related Question