Does Unix grep work faster with long or short search terms

grepperformancespeedunix

Is it faster to look for long or short search terms? Or does it affect speed at all? In other words, should you make search terms as exact as possible?

There are more than 100 000 files and each file contains between 20 and more than 5000 rows of data. Usually the grep is used to find just one instance of the search term.

Let's say the search term is SEARCHTERM, and it will be in a row like this:

NAD+DP+1234567890:92++UNIQUE+NAME+SEARCHTERM++12345+FI'

Is it faster to look for "SEARCH" or "SEARCHTERM"? Let's say that in this case we don't care if we also find matches in other unrelated lines.

This is how I currently do it:

grep NAD+DP 123* | grep SEARCHTERM

But I find it quite slow, still. It usually takes around 3-5 minutes to find the data, even when I know the rough filename, which limits the range to around 10 000 files.

So, would a longer or shorter search term help? As far as I know, grep looks for "blocks" of words of a certain length?

Best Answer

Some reference material:

GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character.

from Why GNU grep is fast.

The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in (the text). [...] In general, the algorithm runs faster as the pattern length increases.

from Boyer–Moore string search algorithm.

Conclusion: Use longer strings.

Now, a bit of benchmark for fun:

# Initialisation
cd $(mktemp -d) && dd if=/dev/urandom of=random bs=1M count=1000
# Version
grep --v` # grep (GNU grep) 2.9
# Benchmark
(for s in 'short' 'this is not so short and we could even consider this as pretty long'; do for t in {1..10}; do time grep "$s" random; done; done ) 2> result

Results: 0.952s is the average for the short string, 0.244s is the average for the long string.

NB: The length is not the only criterion to be taken into account.

Related Solutions

Search with “grep” for folder names

I usually use find:

$ find . -name 'FolderBasename*' -type d

or for more complex queries

$ find . -regex '{FolderRegex}' -type d

As pointed out in the comments if you want case insensitive searches do -iname and -iregex

What does grep do when using a dollar sign in the search term

The reason why every single line of every single file was listed was due to variable substitution of your shell.

When you call

grep -ir "$featuredicon" *

bash will evaluate that. It will look up the variable $featuredicon and put it into your command. Guess what $featuredicon most likely is?

Right, nothing. So what you're actually doing is:

grep -ir "" *

And that matches every line of every file.

Old Answer

The dollar sign ($) is a placeholder for end-of-line in regular expression (I assume the same is true for grep).

If you want to search for a dollar sign, use '\$'.

Be sure to check the StackOverflow question: How to grep for the dollar symbol ($)?

Best Answer

Related Solutions

Search with “grep” for folder names

What does grep do when using a dollar sign in the search term

Old Answer

Related Question