Grep Awk – How to Grep a Huge Number of Patterns from a Huge File

awkdatabasegreptext;

I have a file that's growing about 200,000 lines a day, and it is all formed with blocks of three lines as such:

1358726575123       # key
    Joseph Muller   # name
    carpenter       # job
9973834728345
    Andres Smith
    student
7836472098652
    Mariah Anthony
    dentist

Now, I have another file from which I extract about 10,000 key patterns, such as 1358726575123. Then I run a for loop with these patterns and have to check them against the first file. If the file doesn't contain such pattern, I save the pattern in a third file for further processing:

for number in $(grep -o '[0-9]\{12\}' file2); do  # finds about 10.000 keys
     if ! grep -q ^$number$ file1; then           # file1 is a huge file
         printf "$number\n" >>file3               # we'll process file3 later
     fi
done

The example code greps a huge file 10,000 times, and I run this loop about once a minute, during the whole day.

Since the huge file keeps growing, what can I do to make all this faster and save some CPU? I wonder whether sorting the file somehow by its key (if so, how?) or using a db instead of plain text would help…

Best Answer

This answer is based on the awk answer posted by potong..
It is twice as fast as the comm method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)

Although awk is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...


awk '{ if (/^[^0-9]/) { next }              # Skip lines which do not hold key values
       if (FNR==NR) { main[$0]=1 }          # Process keys from file "mainfile"
       else if (main[$0]==0) { keys[$0]=1 } # Process keys from file "keys"
     } END { for(key in keys) print key }' \
       "mainfile" "keys" >"keys.not-in-main"

# For 6 million lines in "mainfile" and 10 thousand keys in "keys"

# The awk  method
# time:
#   real    0m14.495s
#   user    0m14.457s
#   sys     0m0.044s

# The comm  method
# time:
#   real    0m27.976s
#   user    0m28.046s
#   sys     0m0.104s

Related Question