Extracting lines by key from very large file

text processing

I have a 42M line text file. Th first nine characters of each line are a numeric key. What is the most efficient way to extract only the lines whose key exists in another list of about 1.5M keys? Both the file and the list of keys are sorted.

Best Answer

Using awk should be efficient enough - it provides builtin associative arrays, where the key lookup time is logarithmically proportional to the number of keys (of your lookup table - which is relatively small in your example).

For your input this would be:

42M * log2(1.5M) -> 42M * 20 key comparisons 

(where M means 10^6)

In case your awk uses hash tables, every key lookup would only cost a constant amount of time.

An example of an efficient awk based solution (using the default field separator):

$ awk 'ARGIND == 1 { a[$1] = 1; next } a[$1] { print $0 }' keys.dat largefile.dat

Since both inputs are sorted you could write a script that would be more efficient (with a runtime scaling linearly with both input file sizes). But it would cost more time programming it.

Or you could use join which expect sorted files as input - restriction is that your key needs to be alphabetically sorted - and perhaps you have to tweak the output format. For example:

$ join -j1 keys.dat largefile.dat

Use -t to configure the field separator and -o to adjust the output format.

This should run in time linear to the input size.

Related Question