Why does `find . -type f` take longer than `find .`

findgnuperformance

It seems like find would have to check whether a given path corresponds to a file or directory anyway in order to recursively walk the contents of directories.

Here's some motivation and what I've done locally to convince myself that find . -type f really is slower than find .. I haven't dug into the GNU find source code yet.

So I'm backing up some of the files in my $HOME/Workspace directory, and excluding files that are either dependencies of my projects or version control files.

So I ran the following command which executed quickly

% find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > ws-files-and-dirs.txt

find piped to grep may be bad form, but it seemed like the most direct way to use a negated regex filter.

The following command includes only files in the output of find and took noticeably longer.

% find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > ws-files-only.txt

I wrote some code to test the performance of these two commands (with dash and tcsh, just to rule out any effects the shell might have, even though there shouldn't be any). The tcsh results have been omitted because they're essentially the same.

The results I got showed about a 10% performance penalty for -type f

Here's the output of the program showing the amount of time taken to execute 1000 iterations of various commands.

% perl tester.pl
/bin/sh -c find Workspace/ >/dev/null
82.986582

/bin/sh -c find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
90.313318

/bin/sh -c find Workspace/ -type f >/dev/null
102.882118

/bin/sh -c find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null

109.872865

Tested with

% find --version
find (GNU findutils) 4.4.2
Copyright (C) 2007 Free Software Foundation, Inc.

On Ubuntu 15.10

Here's the perl script I used for benchmarking

#!/usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw[gettimeofday tv_interval];

my $max_iterations = 1000;

my $find_everything_no_grep = <<'EOF';
find Workspace/ >/dev/null
EOF

my $find_everything = <<'EOF';
find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
EOF

my $find_just_file_no_grep = <<'EOF';
find Workspace/ -type f >/dev/null
EOF

my $find_just_file = <<'EOF';
find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
EOF

my @finds = ($find_everything_no_grep, $find_everything,
    $find_just_file_no_grep, $find_just_file);

sub time_command {
    my @args = @_;
    my $start = [gettimeofday()];
    for my $x (1 .. $max_iterations) {
        system(@args);
    }
    return tv_interval($start);
}

for my $shell (["/bin/sh", '-c']) {
    for my $command (@finds) {
        print "@$shell $command";
        printf "%s\n\n", time_command(@$shell, $command);
    }
}

Best Answer

GNU find has an optimization which can be applied to find . but not to find . -type f: if it knows that none of the remaining entries in a directory are directories, then it doesn't bother to determine the file type (with the stat system call) unless one of the search criteria requires it. Calling stat can take measurable time since the information is typically in the inode, in a separate location on the disk, rather than in the containing directory.

How does it know? Because the link count on a directory indicates how many subdirectories it has. On typical Unix filesystems, a directory's link count is 2 plus the number of directories: one for the directory's entry in its parent, one for the . entry, and one for the .. entry in each subdirectory.

The -noleaf option tells find not to apply this optimization. This is useful if find is invoked on some filesystem where directory link counts don't follow the Unix convention.