From my understanding, the two following commands roughly accomplish the same thing:
Command 1:
find -name "filename.xml" -exec grep someString {} \;
Command 2:
grep -r --include=filename.xml someString .
Still, when timing them after warming up in the same context, the first one was about 3 times faster than the second one (something like 4 seconds vs 12 seconds).
The number of file matching the filename pattern in the folder tree that I tested was very small, and each of these files were very small.
This makes me think that most of the time was spent in finding the files matching the filename pattern, and not in the grepping of those matching files.
So why is there such a big difference in performance of those two command lines?
Best Answer
It is actually the opposite way around; the grep command tends to be more efficient in general.
I'll work on a Portage tree snapshot from Gentoo, which are publically available if you want to try.
Let's look which functions are called the most for each:
And also look at the calls that were long:
Quite interesting, you see in this duration output that find is waiting a lot whereas grep does some stuff that is required to start and stop the process. The wait calls take more than 0.001s whereas the find calls decreases to a steady ~0.0002s.
If you look at the wait4 calls in the count output, you will notice that there is an equal amount of clone calls and SIGCHLD signals occuring; this is because find calls the grep process for each file it comes across, this is where its efficiency suffers as cloning and waiting is costly.
There are occasions where it doesn't suffer; you could get a small enough set of files so there isn't much overhead of starting multiple grep processes, you could also have a very slow disk that neglects the overhead of starting a new process, and there are probably other reasons as well. Though when comparing the speed, we often look at how well one or another approach scales, and not look at special corner cases.
In you case you have mentioned that "This is why I feel that it's the way "grep" visits the directory tree that is inefficient compared to "find".", this may indeed be the case; as you can see 1382 read calls have been made whereas find does not do that, this makes grep more I/O intensive for you.
TL;DR: To see why your timings are inefficient, try to do this analysis again and pinpoint the issue in your case such that you know why your specific data and task are not efficient in grep; you'll discover how different grep can behave in your corner case...
So, as others suggested you will want to make sure that it doesn't call grep for each file, which can be done by replacing
\;
by+
near the end.As you can see, 0.027s comes quite close to 0.017s; the difference is mostly attributable to the fact that it still has to call both find and grep as opposed to just grep alone. Or as shown in the comments, on some systems the
+
allows you to improve over grep.