Using awk
should be efficient enough - it provides builtin associative arrays, where the key lookup time is logarithmically proportional to the number of keys (of your lookup table - which is relatively small in your example).
For your input this would be:
42M * log2(1.5M) -> 42M * 20 key comparisons
(where M means 10^6)
In case your awk uses hash tables, every key lookup would only cost a constant amount of time.
An example of an efficient awk based solution (using the default field separator):
$ awk 'ARGIND == 1 { a[$1] = 1; next } a[$1] { print $0 }' keys.dat largefile.dat
Since both inputs are sorted you could write a script that would be more efficient (with a runtime scaling linearly with both input file sizes). But it would cost more time programming it.
Or you could use join
which expect sorted files as input - restriction is that your key needs to be alphabetically sorted - and perhaps you have to tweak the output format. For example:
$ join -j1 keys.dat largefile.dat
Use -t
to configure the field separator and -o
to adjust the output format.
This should run in time linear to the input size.
This will give you all lines whose 9th column matches MEF2
:
awk -F"\t" '$9~/MEF2/' file > output
Assuming your file is always tab-delimited, this will work and you can rest safely.
That's as close to 0 margin of error as you'll ever get.
If, however, you have tried importing into something like R (presumably using read.table("file",sep="\t")
) and that didn't work, you might have some lines with different numbers of fields (see the end for how to check that). If so, assuming you are always interested in the last field, you can use $(NF)
in awk
to print the last field, no matter how many fields there are:
awk -F"\t" '$(NF)~/MEF2/' file > output
If you still feel the need to check, you can simply extract all lines that match MEF2
, irrespective of where the match is, then compare the results:
grep MEF2 file > output2
Once you have that, you can use wc
to check if they have the same number of lines. If they don't, find where they differ by running
grep -vFf output output2
That command will print any lines in output2 that are not present in output1. If there are any, most probably they will have MEF2
somewhere in the line but not in the 9th field. If it's in the 9th field, then you know that your file is not tab separated and there's something wrong with your data.
The awk
above is probably the simplest solution but here are a few others that do the same thing:
Perl
perl -F"\t" -lane '$F[8]=~/MEF2/ && print' file
sed
(this one might match wrong lines if you have more than 9 fields)
sed -n '/\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*/p' file
grep
grep -P '^.+?\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*' file
If these don't all produce the same output, you know there's an issue with your file. One more thing you can check is to make sure that all your lines have 9 fields. If they don't, you know there's an issue:
awk -F"\t" 'NF!=9' file
The above will print all lines that don't have exactly 9 tab-separated fields. If there is output, the lines it prints are problematic.
Best Answer
With
sed
:Example: