Why is the grep/-r/–include combination slower than the find/-exec/grep combination

command linefindgrepperformance

From my understanding, the two following commands roughly accomplish the same thing:

Command 1:

find -name "filename.xml" -exec grep someString {} \;

Command 2:

grep -r --include=filename.xml someString .

Still, when timing them after warming up in the same context, the first one was about 3 times faster than the second one (something like 4 seconds vs 12 seconds).

The number of file matching the filename pattern in the folder tree that I tested was very small, and each of these files were very small.
This makes me think that most of the time was spent in finding the files matching the filename pattern, and not in the grepping of those matching files.

So why is there such a big difference in performance of those two command lines?

Best Answer

It is actually the opposite way around; the grep command tends to be more efficient in general.

I'll work on a Portage tree snapshot from Gentoo, which are publically available if you want to try.

 $ time find /usr/portage/sys-apps/ -name '*.ebuild' -exec grep DEPEND {} \; > /dev/null

real    0m1.184s
user    0m0.033s
sys     0m0.130s

 $ time grep -r --include '*.ebuild' DEPEND /usr/portage/sys-apps/ > /dev/null

real    0m0.017s
user    0m0.007s
sys     0m0.010s

Let's look which functions are called the most for each:

 $ (strace find /usr/portage/sys-apps/ -name '*.ebuild' -exec grep DEPEND {} \; > /dev/null) |& sed 's/[({].*//g' | sort | uniq -c | sort -r | head -n 10
   3574 fcntl
   1597 close
    794 newfstatat
    794 getdents
    689 wait4
    689 clone
    689 --- SIGCHLD 
    404 fstat
    397 openat
     20 mmap

 $ (strace grep -r --include '*.ebuild' DEPEND /usr/portage/sys-apps/ > /dev/null) |& sed 's/[({].*//g' | sort | uniq -c | sort -r | head -n 10
   2779 fcntl
   1493 close
   1382 read
   1096 fstat
   1087 openat
    794 getdents
    792 newfstatat
    691 ioctl
    689 lseek
     25 write

And also look at the calls that were long:

 $ (strace -T find /usr/portage/sys-apps/ -name '*.ebuild' -exec grep DEPEND {} \; > /dev/null) |& sed 's/\(.*\)<\(.*\)>/\2 \1/g' | sort -nk1r | head -n10
exit_group(0)                           = ?
0.001884 wait4(29725, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29725 
0.001879 wait4(29475, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29475 
0.001813 wait4(29430, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 29430 
0.001812 wait4(30089, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30089 
0.001807 wait4(29722, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 29722 
0.001795 wait4(29645, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29645 
0.001794 wait4(29848, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 29848 
0.001759 wait4(30032, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30032 
0.001754 wait4(30093, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30093

 $ (strace -T grep -r --include '*.ebuild' DEPEND /usr/portage/sys-apps/ > /dev/null) |& 
exit_group(0)                           = ?
0.002336 fcntl(3, F_SETFD, FD_CLOEXEC)           = 0 
0.000460 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\30`C6\0\0\0"..., 832) = 832 
0.000313 close(3)                                = 0 
0.000295 execve("/bin/grep", ["grep", "-r", "--include", "*.ebuild", "DEPEND", "/usr/portage/sys-apps/"], [/* 75 vars */]) = 0 
0.000276 fcntl(3, F_SETFD, FD_CLOEXEC)           = 0 
0.000265 getdents(3, /* 244 entries */, 32768)   = 7856 
0.000233 fstat(3, {st_mode=S_IFREG|0644, st_size=826, ...}) = 0 
0.000162 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 
0.000137 lseek(3, 1402, 0x4 /* SEEK_??? */)      = -1 ENXIO (No such device or address)

Quite interesting, you see in this duration output that find is waiting a lot whereas grep does some stuff that is required to start and stop the process. The wait calls take more than 0.001s whereas the find calls decreases to a steady ~0.0002s.

If you look at the wait4 calls in the count output, you will notice that there is an equal amount of clone calls and SIGCHLD signals occuring; this is because find calls the grep process for each file it comes across, this is where its efficiency suffers as cloning and waiting is costly.

There are occasions where it doesn't suffer; you could get a small enough set of files so there isn't much overhead of starting multiple grep processes, you could also have a very slow disk that neglects the overhead of starting a new process, and there are probably other reasons as well. Though when comparing the speed, we often look at how well one or another approach scales, and not look at special corner cases.

In you case you have mentioned that "This is why I feel that it's the way "grep" visits the directory tree that is inefficient compared to "find".", this may indeed be the case; as you can see 1382 read calls have been made whereas find does not do that, this makes grep more I/O intensive for you.

TL;DR: To see why your timings are inefficient, try to do this analysis again and pinpoint the issue in your case such that you know why your specific data and task are not efficient in grep; you'll discover how different grep can behave in your corner case...

So, as others suggested you will want to make sure that it doesn't call grep for each file, which can be done by replacing \; by + near the end.

 $ time find /usr/portage/sys-apps/ -name '*.ebuild' -exec grep DEPEND {} + > /dev/null

real    0m0.027s
user    0m0.010s
sys     0m0.013s

As you can see, 0.027s comes quite close to 0.017s; the difference is mostly attributable to the fact that it still has to call both find and grep as opposed to just grep alone. Or as shown in the comments, on some systems the + allows you to improve over grep.

Best Answer

Related Solutions

Using awk with find -exec

Related Question