It seems like find
would have to check whether a given path corresponds to a file or directory anyway in order to recursively walk the contents of directories.
Here's some motivation and what I've done locally to convince myself that find . -type f
really is slower than find .
. I haven't dug into the GNU find source code yet.
So I'm backing up some of the files in my $HOME/Workspace
directory, and excluding files that are either dependencies of my projects or version control files.
So I ran the following command which executed quickly
% find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > ws-files-and-dirs.txt
find
piped to grep
may be bad form, but it seemed like the most direct way to use a negated regex filter.
The following command includes only files in the output of find and took noticeably longer.
% find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > ws-files-only.txt
I wrote some code to test the performance of these two commands (with dash
and tcsh
, just to rule out any effects the shell might have, even though there shouldn't be any). The tcsh
results have been omitted because they're essentially the same.
The results I got showed about a 10% performance penalty for -type f
Here's the output of the program showing the amount of time taken to execute 1000 iterations of various commands.
% perl tester.pl
/bin/sh -c find Workspace/ >/dev/null
82.986582
/bin/sh -c find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
90.313318
/bin/sh -c find Workspace/ -type f >/dev/null
102.882118
/bin/sh -c find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
109.872865
Tested with
% find --version
find (GNU findutils) 4.4.2
Copyright (C) 2007 Free Software Foundation, Inc.
On Ubuntu 15.10
Here's the perl script I used for benchmarking
#!/usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw[gettimeofday tv_interval];
my $max_iterations = 1000;
my $find_everything_no_grep = <<'EOF';
find Workspace/ >/dev/null
EOF
my $find_everything = <<'EOF';
find Workspace/ | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
EOF
my $find_just_file_no_grep = <<'EOF';
find Workspace/ -type f >/dev/null
EOF
my $find_just_file = <<'EOF';
find Workspace/ -type f | grep -v '/vendor\|/node_modules/\|Workspace/sources/\|/venv/\|/.git/' > /dev/null
EOF
my @finds = ($find_everything_no_grep, $find_everything,
$find_just_file_no_grep, $find_just_file);
sub time_command {
my @args = @_;
my $start = [gettimeofday()];
for my $x (1 .. $max_iterations) {
system(@args);
}
return tv_interval($start);
}
for my $shell (["/bin/sh", '-c']) {
for my $command (@finds) {
print "@$shell $command";
printf "%s\n\n", time_command(@$shell, $command);
}
}
Best Answer
GNU find has an optimization which can be applied to
find .
but not tofind . -type f
: if it knows that none of the remaining entries in a directory are directories, then it doesn't bother to determine the file type (with thestat
system call) unless one of the search criteria requires it. Callingstat
can take measurable time since the information is typically in the inode, in a separate location on the disk, rather than in the containing directory.How does it know? Because the link count on a directory indicates how many subdirectories it has. On typical Unix filesystems, a directory's link count is 2 plus the number of directories: one for the directory's entry in its parent, one for the
.
entry, and one for the..
entry in each subdirectory.The
-noleaf
option tellsfind
not to apply this optimization. This is useful iffind
is invoked on some filesystem where directory link counts don't follow the Unix convention.