Bash – Most Resource Efficient Way to Count Files in a Directory

bashdirectorylsshell

CentOS 5.9

I came across an issue the other day where a directory had a lot of files. To count it, I ran ls -l /foo/foo2/ | wc -l

Turns out that there were over 1 million files in a single directory (long story — the root cause is getting fixed).

My question is: is there a faster way to do the count? What would be the most efficient way to get the count?

Best Answer

Short answer:

\ls -afq | wc -l

(This includes . and .., so subtract 2.)

When you list the files in a directory, three common things might happen:

Enumerating the file names in the directory. This is inescapable: there is no way to count the files in a directory without enumerating them.
Sorting the file names. Shell wildcards and the ls command do that.
Calling stat to retrieve metadata about each directory entry, such as whether it is a directory.

#3 is the most expensive by far, because it requires loading an inode for each file. In comparison all the file names needed for #1 are compactly stored in a few blocks. #2 wastes some CPU time but it is often not a deal breaker.

If there are no newlines in file names, a simple ls -A | wc -l tells you how many files there are in the directory. Beware that if you have an alias for ls, this may trigger a call to stat (e.g. ls --color or ls -F need to know the file type, which requires a call to stat), so from the command line, call command ls -A | wc -l or \ls -A | wc -l to avoid an alias.

If there are newlines in the file name, whether newlines are listed or not depends on the Unix variant. GNU coreutils and BusyBox default to displaying ? for a newline, so they're safe.

Call ls -f to list the entries without sorting them (#2). This automatically turns on -a (at least on modern systems). The -f option is in POSIX but with optional status; most implementations support it, but not BusyBox. The option -q replaces non-printable characters including newlines by ?; it's POSIX but isn't supported by BusyBox, so omit it if you need BusyBox support at the expense of overcounting files whose name contains a newline character.

If the directory has no subdirectories, then most versions of find will not call stat on its entries (leaf directory optimization: a directory that has a link count of 2 cannot have subdirectories, so find doesn't need to look up the metadata of the entries unless a condition such as -type requires it). So find . | wc -l is a portable, fast way to count files in a directory provided that the directory has no subdirectories and that no file name contains a newline.

If the directory has no subdirectories but file names may contain newlines, try one of these (the second one should be faster if it's supported, but may not be noticeably so).

find -print0 | tr -dc \\0 | wc -c
find -printf a | wc -c

On the other hand, don't use find if the directory has subdirectories: even find . -maxdepth 1 calls stat on every entry (at least with GNU find and BusyBox find). You avoid sorting (#2) but you pay the price of an inode lookup (#3) which kills performance.

In the shell without external tools, you can run count the files in the current directory with set -- *; echo $#. This misses dot files (files whose name begins with .) and reports 1 instead of 0 in an empty directory. This is the fastest way to count files in small directories because it doesn't require starting an external program, but (except in zsh) wastes time for larger directories due to the sorting step (#2).

In bash, this is a reliable way to count the files in the current directory:
```
shopt -s dotglob nullglob
a=(*)
echo ${#a[@]}
```
In ksh93, this is a reliable way to count the files in the current directory:
```
FIGNORE='@(.|..)'
a=(~(N)*)
echo ${#a[@]}
```
In zsh, this is a reliable way to count the files in the current directory:
```
a=(*(DNoN))
echo $#a
```
If you have the mark_dirs option set, make sure to turn it off: a=(*(DNoN^M)).

In any POSIX shell, this is a reliable way to count the files in the current directory:

total=0
set -- *
if [ $# -ne 1 ] || [ -e "$1" ] || [ -L "$1" ]; then total=$((total+$#)); fi
set -- .[!.]*
if [ $# -ne 1 ] || [ -e "$1" ] || [ -L "$1" ]; then total=$((total+$#)); fi
set -- ..?*
if [ $# -ne 1 ] || [ -e "$1" ] || [ -L "$1" ]; then total=$((total+$#)); fi
echo "$total"

All of these methods sort the file names, except for the zsh one.

Best Answer

Related Solutions

Bash – Why is deleting files by name painfully slow and also exceptionally fast

Best Way to Count the Number of Files in a Directory

Related Question