Optimizing GNU grep

grep

I am using egrep (grep -E) with a PATTERN file. (-f path/to/file).

This is done in an infinite loop on a stream of text. This implies that I cannot accumulate and pass ALL the input to grep at once (like *.log).

Is there a way to make grep "save" the NFA it is building from the PATTERN file to use for it's next run?

I have searched Google and read the documentation with no luck.

I'll try to explain it a little bit more. I need to locate a fixed number of strings with regexes (This is not a part of a question but feel free to suggest otherwise) such as IP addresses, domains etc. The search is done on a feed from the internet. You can think about it as a stream of text.
I can't use grep on all of the input since it's a stream.
I can accumulate a chunk of stream and use grep on it (thus not using grep on each line) but this is also limited (let's say for 30 seconds).

I know grep is building an NFA from all of its patterns (in my case from a file).
So my question here is: can I tell grep to save that NFA for the next run, since it is not going to change? That would save me the time of building that NFA every time.

Best Answer

No, there's no such thing. Generally the cost of starting grep (fork a new process, load the executable, shared library, dynamic linkage...) would be a lot greater than compiling the regexps, so this kind of optimisation would make little sense.

Though see Why is matching 1250 strings against 90k patterns so slow? about a bug in some versions of GNU grep that would make it particularly slow for a great number of regexps.

Possibly here, you could avoid running grep several times by feeding your chunks to the same grep instance, for instance by using it as a co-process and use a marker to detect the end. With zsh and GNU grep and awk implementations other than mawk:

coproc grep -E -f patterns -e '^@@MARKER@@$' --line-buffered
process_chunk() {
  { cat; echo @@MARKER@@; } >&p & awk '$0 == "@@MARKER@@"{exit};1' <&p
}
process_chunk < chunk1 > chunk1.grepped
process_chunk < chunk2 > chunk2.grepped

Though it may be simpler to do the whole thing with awk or perl instead.

But if you don't need the grep output to go into different files for different chunks, you can always do:

{
  cat chunk1
  while wget -qO- ...; done # or whatever you use to fetch those chunks
  ...
} | grep -Ef patterns > output

Related Solutions

Grep Awk – How to Grep a Huge Number of Patterns from a Huge File

This answer is based on the awk answer posted by potong..
It is twice as fast as the comm method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)

Although awk is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...

awk '{ if (/^[^0-9]/) { next }              # Skip lines which do not hold key values
       if (FNR==NR) { main[$0]=1 }          # Process keys from file "mainfile"
       else if (main[$0]==0) { keys[$0]=1 } # Process keys from file "keys"
     } END { for(key in keys) print key }' \
       "mainfile" "keys" >"keys.not-in-main"

# For 6 million lines in "mainfile" and 10 thousand keys in "keys"

# The awk  method
# time:
#   real    0m14.495s
#   user    0m14.457s
#   sys     0m0.044s

# The comm  method
# time:
#   real    0m27.976s
#   user    0m28.046s
#   sys     0m0.104s

Grep regular expression solution (greedy not working)

The pattern *: will match everything until the last :. To stop at the next : you need *?:. E.g.:

% grep -Poz '^:MENU1\K[\w\W]*?:' data.txt 

0. public
1. admin
2. webmail

:

You can strip the first line by matching the newline before your \K. E.g.:

% grep -Poz '^:MENU1\n\K[\w\W]*?:' data.txt 
0. public
1. admin
2. webmail

:

To eat the empty line and the : you can match and discard that text. E.g.:

% grep -Poz '^:MENU1\n\K[\w\W]*?(?=\n+:)' data.txt 
0. public
1. admin
2. webmail

next we can simplify your character class, to match on anything but ::

% grep -Poz '^:MENU1\n\K[^:]*?(?=\n+:)' data.txt 
0. public
1. admin
2. webmail

And finally we can rewrite the initial part of the match:

% grep -Poz '(?<=:MENU1\n)[^:]*?(?=\n+:)' data.txt 
0. public
1. admin
2. webmail

This is similar to what @terdon came up with, but this takes care of the blank lines without another call to grep.

This final regex makes use of look-around assertions. The (?<=pattern) is a look-behind assertion that lets you match the pattern but not include it as part of the output. The (?=pattern) is a look-ahead assertion and lets us match on the trailing pattern without including it in the output.

Best Answer

Related Solutions

Grep Awk – How to Grep a Huge Number of Patterns from a Huge File

Grep regular expression solution (greedy not working)

Related Question