I have a file that's growing about 200,000 lines a day, and it is all formed with blocks of three lines as such:
1358726575123 # key
Joseph Muller # name
carpenter # job
9973834728345
Andres Smith
student
7836472098652
Mariah Anthony
dentist
Now, I have another file from which I extract about 10,000 key patterns, such as 1358726575123
. Then I run a for
loop with these patterns and have to check them against the first file. If the file doesn't contain such pattern, I save the pattern in a third file for further processing:
for number in $(grep -o '[0-9]\{12\}' file2); do # finds about 10.000 keys
if ! grep -q ^$number$ file1; then # file1 is a huge file
printf "$number\n" >>file3 # we'll process file3 later
fi
done
The example code greps a huge file 10,000 times, and I run this loop about once a minute, during the whole day.
Since the huge file keeps growing, what can I do to make all this faster and save some CPU? I wonder whether sorting the file somehow by its key (if so, how?) or using a db instead of plain text would help…
Best Answer
This answer is based on the
awk
answer posted by potong..It is twice as fast as the
comm
method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)Although
awk
is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...