Why is egrep [wW][oO][rR][dD] faster than grep -i word

grepperformance

I've been using grep -i more often and I found out that it is slower than its egrep equivalent, where I match against the upper or lower case of each letter:

$ time grep -iq "thats" testfile

real    0m0.041s
user    0m0.038s
sys     0m0.003s
$ time egrep -q "[tT][hH][aA][tT][sS]" testfile

real    0m0.010s
user    0m0.003s
sys     0m0.006s

Does grep -i do additional tests that egrep doesn't?

Best Answer

grep -i 'a' is equivalent to grep '[Aa]' in an ASCII-only locale. In a Unicode locale, character equivalences and conversions can be complex, so grep may have to do extra work to determine which characters are equivalent. The relevant locale setting is LC_CTYPE, which determines how bytes are interpreted as characters.

In my experience, GNU grep can be slow when invoked in a UTF-8 locale. If you know that you're searching for ASCII characters only, invoking it in an ASCII-only locale may be faster. I expect that

time LC_ALL=C grep -iq "thats" testfile
time LC_ALL=C egrep -q "[tT][hH][aA][tT][sS]" testfile

would produce indistinguishable timings.

That being said, I can't reproduce your finding with GNU grep on Debian jessie (but you didn't specify your test file). If I set an ASCII locale (LC_ALL=C), grep -i is faster. The effects depend on the exact nature of the string, for example a string with repeated characters reduces the performance (which is to be expected).