Optimizing GNU grep

grep

I am using egrep (grep -E) with a PATTERN file. (-f path/to/file).

This is done in an infinite loop on a stream of text. This implies that I cannot accumulate and pass ALL the input to grep at once (like *.log).

Is there a way to make grep "save" the NFA it is building from the PATTERN file to use for it's next run?

I have searched Google and read the documentation with no luck.

I'll try to explain it a little bit more. I need to locate a fixed number of strings with regexes (This is not a part of a question but feel free to suggest otherwise) such as IP addresses, domains etc. The search is done on a feed from the internet. You can think about it as a stream of text.
I can't use grep on all of the input since it's a stream.
I can accumulate a chunk of stream and use grep on it (thus not using grep on each line) but this is also limited (let's say for 30 seconds).

I know grep is building an NFA from all of its patterns (in my case from a file).
So my question here is: can I tell grep to save that NFA for the next run, since it is not going to change? That would save me the time of building that NFA every time.

Best Answer

No, there's no such thing. Generally the cost of starting grep (fork a new process, load the executable, shared library, dynamic linkage...) would be a lot greater than compiling the regexps, so this kind of optimisation would make little sense.

Though see Why is matching 1250 strings against 90k patterns so slow? about a bug in some versions of GNU grep that would make it particularly slow for a great number of regexps.

Possibly here, you could avoid running grep several times by feeding your chunks to the same grep instance, for instance by using it as a co-process and use a marker to detect the end. With zsh and GNU grep and awk implementations other than mawk:

coproc grep -E -f patterns -e '^@@MARKER@@$' --line-buffered
process_chunk() {
  { cat; echo @@MARKER@@; } >&p & awk '$0 == "@@MARKER@@"{exit};1' <&p
}
process_chunk < chunk1 > chunk1.grepped
process_chunk < chunk2 > chunk2.grepped

Though it may be simpler to do the whole thing with awk or perl instead.

But if you don't need the grep output to go into different files for different chunks, you can always do:

{
  cat chunk1
  while wget -qO- ...; done # or whatever you use to fetch those chunks
  ...
} | grep -Ef patterns > output
Related Question