Does Unix grep work faster with long or short search terms

grepperformancespeedunix

Is it faster to look for long or short search terms? Or does it affect speed at all? In other words, should you make search terms as exact as possible?

There are more than 100 000 files and each file contains between 20 and more than 5000 rows of data. Usually the grep is used to find just one instance of the search term.

Let's say the search term is SEARCHTERM, and it will be in a row like this:

NAD+DP+1234567890:92++UNIQUE+NAME+SEARCHTERM++12345+FI'

Is it faster to look for "SEARCH" or "SEARCHTERM"? Let's say that in this case we don't care if we also find matches in other unrelated lines.

This is how I currently do it:

grep NAD+DP 123* | grep SEARCHTERM

But I find it quite slow, still. It usually takes around 3-5 minutes to find the data, even when I know the rough filename, which limits the range to around 10 000 files.

So, would a longer or shorter search term help? As far as I know, grep looks for "blocks" of words of a certain length?

Best Answer

Some reference material:

GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character.

from Why GNU grep is fast.

The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in (the text). [...] In general, the algorithm runs faster as the pattern length increases.

from Boyer–Moore string search algorithm.

Conclusion: Use longer strings.

Now, a bit of benchmark for fun:

# Initialisation
cd $(mktemp -d) && dd if=/dev/urandom of=random bs=1M count=1000
# Version
grep --v` # grep (GNU grep) 2.9
# Benchmark
(for s in 'short' 'this is not so short and we could even consider this as pretty long'; do for t in {1..10}; do time grep "$s" random; done; done ) 2> result

Results: 0.952s is the average for the short string, 0.244s is the average for the long string.

NB: The length is not the only criterion to be taken into account.

Related Question