Grep Awk – How to Grep a Huge Number of Patterns from a Huge File

awkdatabasegreptext;

I have a file that's growing about 200,000 lines a day, and it is all formed with blocks of three lines as such:

1358726575123       # key
    Joseph Muller   # name
    carpenter       # job
9973834728345
    Andres Smith
    student
7836472098652
    Mariah Anthony
    dentist

Now, I have another file from which I extract about 10,000 key patterns, such as 1358726575123. Then I run a for loop with these patterns and have to check them against the first file. If the file doesn't contain such pattern, I save the pattern in a third file for further processing:

for number in $(grep -o '[0-9]\{12\}' file2); do  # finds about 10.000 keys
     if ! grep -q ^$number$ file1; then           # file1 is a huge file
         printf "$number\n" >>file3               # we'll process file3 later
     fi
done

The example code greps a huge file 10,000 times, and I run this loop about once a minute, during the whole day.

Since the huge file keeps growing, what can I do to make all this faster and save some CPU? I wonder whether sorting the file somehow by its key (if so, how?) or using a db instead of plain text would help…

Best Answer

This answer is based on the awk answer posted by potong..
It is twice as fast as the comm method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)

Although awk is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...

awk '{ if (/^[^0-9]/) { next }              # Skip lines which do not hold key values
       if (FNR==NR) { main[$0]=1 }          # Process keys from file "mainfile"
       else if (main[$0]==0) { keys[$0]=1 } # Process keys from file "keys"
     } END { for(key in keys) print key }' \
       "mainfile" "keys" >"keys.not-in-main"

# For 6 million lines in "mainfile" and 10 thousand keys in "keys"

# The awk  method
# time:
#   real    0m14.495s
#   user    0m14.457s
#   sys     0m0.044s

# The comm  method
# time:
#   real    0m27.976s
#   user    0m28.046s
#   sys     0m0.104s

Related Solutions

Reading grep patterns from a file

The -f option specifies a file where grep reads patterns. That's just like passing patterns on the command line (with the -e option if there's more than one), except that when you're calling from a shell you may need to quote the pattern to protect special characters in it from being expanded by the shell.

The argument -E or -F or -P, if any, tells grep which syntax the patterns are written in. With no argument, grep expects basic regular expressions; with -E, grep expects extended regular expressions; with -P (if supported), grep expects Perl regular expressions; and with -F, grep expects literal strings. Whether the patterns come from the command line or from a file doesn't matter.

Note that the strings are substrings: if you pass a+b as a pattern then a line containing a+b+c is matched. If you want to search for lines containing exactly one of the supplied strings and no more, then pass the -x option.

Grep – Print Unmatched Patterns Using Grep with Patterns from File

You could use grep -o to print only the matching part and use the result as patterns for a second grep -v on the original patterns.txt file:

grep -oFf patterns.txt Strings.xml | grep -vFf - patterns.txt

Though in this particular case you could also use join + sort:

join -t\" -v1 -j2 -o 1.1 1.2 1.3 <(sort -t\" -k2 patterns.txt) <(sort -t\" -k2 strings.xml)

Best Answer

Related Solutions

Reading grep patterns from a file

Grep – Print Unmatched Patterns Using Grep with Patterns from File

Related Question