Bash – Grepping over a huge file performance

algorithmsbashgreplarge files

I have FILE_A which has over 300K lines and FILE_B which has over 30M lines.
I created a bash script that greps each line in FILE_A over in FILE_B and writes the result of the grep to a new file.

This whole process is taking over 5+ hours.

I'm looking for suggestions on whether you see any way of improving the performance of my script.

I'm using grep -F -m 1 as the grep command.
FILE_A looks like this:

123456789 
123455321

and FILE_B is like this:

123456789,123456789,730025400149993,
123455321,123455321,730025400126097,

So with bash I have a while loop that picks the next line in FILE_A and greps it over in FILE_B. When the pattern is found in FILE_B i write it to result.txt.

while read -r line; do
   grep -F -m1 $line 30MFile
done < 300KFile

Thanks a lot in advance for your help.

Best Answer

The key to performance is reading the huge file only once.

You can pass multiple patterns to grep by putting them on separate lines. This is usually done by telling grep to read patterns from a file:

grep -F -f 300KFile 30MFile

This outputs the matches in the order of the large file, and prints lines that match multiple patterns only once. Furthermore, this looks for patterns anywhere in the line; for example, if the pattern file contains 1234, then lines such as 123456,345678,2348962342 and 478912,1211138,1234 will match.

You can restrict to exact column matches by preprocessing the pattern. For example, if the patterns do not contain any special character ()?*+\|[]{}:

<300KFile sed -e 's/^/(^|,)/' -e 's/$/($|,)/' |
grep -E -f - 30MFile

If retaining only the first match for each pattern is important, make a first pass to extract only the relevant lines as above, then do a second pass in awk or perl that tracks patterns that have already been seen.

<300KFile sed -e 's/^/(^|,)/' -e 's/$/($|,)/' |
grep -E -f - 30MFile |
perl -l -F, -ape '
    BEGIN {
        open P, "300KFile" or die;
        %patterns = map {chomp; $_=>1} <P>;
        close P;
    }
    foreach $c (@F) {
        if ($patterns{$c}) {
            print;
            delete $patterns{$c};
        }
    }
'

Related Solutions

Bash – get output and return value of grep in single operation in bash

You can use

output=$(grep -c 'name' inputfile)

The variable output will then contain the number 0, 1, or 2. Then you can use an if statement to execute different things.

Shell – How to (Memory Limited) > grep -F -f file_A file_B >> output.txt

Loop through chunks of file_A, sending them as stdin to the same grep statement; adjust 1000 to your available memory:

nlines=$(wc -l < file_A)
chunk=1000
for((i=1; i < nlines; i += chunk)) 
do 
  sed -n $i,+$((chunk - 1))p file_A | grep -F -f - file_B
done > output

Best Answer

Related Solutions

Bash – get output and return value of grep in single operation in bash

Shell – How to (Memory Limited) > grep -F -f file_A file_B >> output.txt

Related Question