I am using egrep (grep -E
) with a PATTERN file. (-f path/to/file
).
This is done in an infinite loop on a stream of text. This implies that I cannot accumulate and pass ALL the input to grep at once (like *.log
).
Is there a way to make grep "save" the NFA it is building from the PATTERN file to use for it's next run?
I have searched Google and read the documentation with no luck.
I'll try to explain it a little bit more. I need to locate a fixed number of strings with regexes (This is not a part of a question but feel free to suggest otherwise) such as IP addresses, domains etc. The search is done on a feed from the internet. You can think about it as a stream of text.
I can't use grep
on all of the input since it's a stream.
I can accumulate a chunk of stream and use grep
on it (thus not using grep
on each line) but this is also limited (let's say for 30 seconds).
I know grep
is building an NFA from all of its patterns (in my case from a file).
So my question here is: can I tell grep
to save that NFA for the next run, since it is not going to change? That would save me the time of building that NFA every time.
Best Answer
No, there's no such thing. Generally the cost of starting
grep
(fork a new process, load the executable, shared library, dynamic linkage...) would be a lot greater than compiling the regexps, so this kind of optimisation would make little sense.Though see Why is matching 1250 strings against 90k patterns so slow? about a bug in some versions of GNU
grep
that would make it particularly slow for a great number of regexps.Possibly here, you could avoid running
grep
several times by feeding your chunks to the samegrep
instance, for instance by using it as a co-process and use a marker to detect the end. Withzsh
and GNUgrep
andawk
implementations other thanmawk
:Though it may be simpler to do the whole thing with
awk
orperl
instead.But if you don't need the
grep
output to go into different files for different chunks, you can always do: