Why does `find . -type f` take longer than `find .`

findgnuperformance

It seems like find would have to check whether a given path corresponds to a file or directory anyway in order to recursively walk the contents of directories.

Here's some motivation and what I've done locally to convince myself that find . -type f really is slower than find .. I haven't dug into the GNU find source code yet.

So I'm backing up some of the files in my $HOME/Workspace directory, and excluding files that are either dependencies of my projects or version control files.

So I ran the following command which executed quickly

% find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > ws-files-and-dirs.txt

find piped to grep may be bad form, but it seemed like the most direct way to use a negated regex filter.

The following command includes only files in the output of find and took noticeably longer.

% find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > ws-files-only.txt

I wrote some code to test the performance of these two commands (with dash and tcsh, just to rule out any effects the shell might have, even though there shouldn't be any). The tcsh results have been omitted because they're essentially the same.

The results I got showed about a 10% performance penalty for -type f

Here's the output of the program showing the amount of time taken to execute 1000 iterations of various commands.

% perl tester.pl
/bin/sh -c find Workspace/ >/dev/null
82.986582

/bin/sh -c find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
90.313318

/bin/sh -c find Workspace/ -type f >/dev/null
102.882118

/bin/sh -c find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null

109.872865

Tested with

% find --version
find (GNU findutils) 4.4.2
Copyright (C) 2007 Free Software Foundation, Inc.

On Ubuntu 15.10

Here's the perl script I used for benchmarking

#!/usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw[gettimeofday tv_interval];

my $max_iterations = 1000;

my $find_everything_no_grep = <<'EOF';
find Workspace/ >/dev/null
EOF

my $find_everything = <<'EOF';
find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
EOF

my $find_just_file_no_grep = <<'EOF';
find Workspace/ -type f >/dev/null
EOF

my $find_just_file = <<'EOF';
find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
EOF

my @finds = ($find_everything_no_grep, $find_everything,
    $find_just_file_no_grep, $find_just_file);

sub time_command {
    my @args = @_;
    my $start = [gettimeofday()];
    for my $x (1 .. $max_iterations) {
        system(@args);
    }
    return tv_interval($start);
}

for my $shell (["/bin/sh", '-c']) {
    for my $command (@finds) {
        print "@$shell $command";
        printf "%s\n\n", time_command(@$shell, $command);
    }
}

Best Answer

GNU find has an optimization which can be applied to find . but not to find . -type f: if it knows that none of the remaining entries in a directory are directories, then it doesn't bother to determine the file type (with the stat system call) unless one of the search criteria requires it. Calling stat can take measurable time since the information is typically in the inode, in a separate location on the disk, rather than in the containing directory.

How does it know? Because the link count on a directory indicates how many subdirectories it has. On typical Unix filesystems, a directory's link count is 2 plus the number of directories: one for the directory's entry in its parent, one for the . entry, and one for the .. entry in each subdirectory.

The -noleaf option tells find not to apply this optimization. This is useful if find is invoked on some filesystem where directory link counts don't follow the Unix convention.

Related Solutions

Why does “ls *” take so much longer than “ls”

When you run ls without arguments, it will just open a directory, read all the contents, sort them and print them out.

When you run ls *, first the shell expands *, which is effectively the same as what the simple ls did, builds an argument vector with all the files in the current directory and calls ls. ls then has to process that argument vector and for each argument, and calls access(2)¹ the file to check it's existence. Then it will print out the same output as the first (simple) ls. Both the shell's processing of the large argument vector and ls's will likely involve a lot of memory allocation of small blocks, which can take some time. However, since there was little sys and user time, but a lot of real time, most of the time would have been spent waiting for disk, rather than using CPU doing memory allocation.

Each call to access(2) will need to read the file's inode to get the permission information. That means a lot more disk reads and seeks than simply reading a directory. I do not know how expensive these operations are on your GPFS, but as the comparison you've shown to ls -l which has a similar run time to the wildcard case, the time needed to retrieve the inode information appears to dominate. If GPFS has a slightly higher latency than your local filesystem on each read operation, we would expect it to be more pronounced in these cases.

The difference between the wildcard case and ls -l of 50% could be explained by the ordering of inodes on the disk. If the inodes were laid out successively in the same order as the filenames in the directory and ls -l stat(2)ed the files in directory order before sorting, ls -l would possibly read most of the inodes in a sweep. With the wildcard, the shell will sort the filenames before passing them to ls, so ls will likely read the inodes in a different order, adding more disk head movement.

It should be noted that your time output will not include the time taken by the shell to expand the wildcard.

If you really want to see what's going on, use strace(1):

strace -o /tmp/ls-star.trace ls *
strace -o /tmp/ls-l-star.trace ls -l *

and have a look which system calls are being performed in each case.

¹ I don't know if access(2) is actually used, or something else such as stat(2). But both probably require an inode lookup (I'm not sure if access(file, 0) would bypass an inode lookup.)

Find – Why Does find -mtime +1 Return Files Older Than 2 Days?

Well, the simple answer is, I guess, that your find implementation is following the POSIX/SuS standard, which says it must behave this way. Quoting from SUSv4/IEEE Std 1003.1, 2013 Edition, "find":

-mtime n
The primary shall evaluate as true if the file modification time subtracted
from the initialization time, divided by 86400 (with any remainder discarded), is n.

(Elsewhere in that document it explains that n can actually be +n, and the meaning of that as "greater than").

As to why the standard says it shall behave that way—well, I'd guess long in the past a programmer was lazy or not thinking about it, and just wrote the C code (current_time - file_time) / 86400. C integer arithmetic discards the remainder. Scripts started depending on that behavior, and thus it was standardized.

The spec'd behavior would also be portable to a hypothetical system that only stored a modification date (not time). I don't know if such a system has existed.

Best Answer

Related Solutions

Why does “ls *” take so much longer than “ls”

Find – Why Does find -mtime +1 Return Files Older Than 2 Days?

Related Question