I have a 42M line text file. Th first nine characters of each line are a numeric key. What is the most efficient way to extract only the lines whose key exists in another list of about 1.5M keys? Both the file and the list of keys are sorted.
Extracting lines by key from very large file
text processing
Related Solutions
This will give you all lines whose 9th column matches MEF2
:
awk -F"\t" '$9~/MEF2/' file > output
Assuming your file is always tab-delimited, this will work and you can rest safely. That's as close to 0 margin of error as you'll ever get.
If, however, you have tried importing into something like R (presumably using read.table("file",sep="\t")
) and that didn't work, you might have some lines with different numbers of fields (see the end for how to check that). If so, assuming you are always interested in the last field, you can use $(NF)
in awk
to print the last field, no matter how many fields there are:
awk -F"\t" '$(NF)~/MEF2/' file > output
If you still feel the need to check, you can simply extract all lines that match MEF2
, irrespective of where the match is, then compare the results:
grep MEF2 file > output2
Once you have that, you can use wc
to check if they have the same number of lines. If they don't, find where they differ by running
grep -vFf output output2
That command will print any lines in output2 that are not present in output1. If there are any, most probably they will have MEF2
somewhere in the line but not in the 9th field. If it's in the 9th field, then you know that your file is not tab separated and there's something wrong with your data.
The awk
above is probably the simplest solution but here are a few others that do the same thing:
Perl
perl -F"\t" -lane '$F[8]=~/MEF2/ && print' file
sed
(this one might match wrong lines if you have more than 9 fields)sed -n '/\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*/p' file
grep
grep -P '^.+?\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*' file
If these don't all produce the same output, you know there's an issue with your file. One more thing you can check is to make sure that all your lines have 9 fields. If they don't, you know there's an issue:
awk -F"\t" 'NF!=9' file
The above will print all lines that don't have exactly 9 tab-separated fields. If there is output, the lines it prints are problematic.
Since all input files are already sorted, we may bypass the actual sorting step and just use sort -m
for merging the files together.
On some Unix systems (to my knowledge only Linux), it may be enough to do
sort -m *.words | uniq -d >dupes.txt
to get the duplicated lines written to the file dupes.txt
.
To find what files these lines came from, you may then do
grep -Fx -f dupes.txt *.words
This will instruct grep
to treat the lines in dupes.txt
(-f dupes.txt
) as fixed string patterns (-F
). grep
will also require that the whole line matches perfectly from start to finish (-x
). It will print the file name and the line to the terminal.
Non-Linux Unices (or even more files)
On some Unix systems, 30000 file names will expand to a string that is too long to pass to a single utility (meaning sort -m *.words
will fail with Argument list too long
, which it does on my OpenBSD system). Even Linux will complain about this if the number of files are much larger.
Finding the dupes
This means that in the general case (this will also work with many more than just 30000 files), one has to "chunk" the sorting:
rm -f tmpfile
find . -type f -name '*.words' -print0 |
xargs -0 sh -c '
if [ -f tmpfile ]; then
sort -o tmpfile -m tmpfile "$@"
else
sort -o tmpfile -m "$@"
fi' sh
Alternatively, creating tmpfile
without xargs
:
rm -f tmpfile
find . -type f -name '*.words' -exec sh -c '
if [ -f tmpfile ]; then
sort -o tmpfile -m tmpfile "$@"
else
sort -o tmpfile -m "$@"
fi' sh {} +
This will find all files in the current directory (or below) whose names matches *.words
. For an appropriately sized chunk of these names at a time, the size of which is determined by xargs
/find
, it merges them together into the sorted tmpfile
file. If tmpfile
already exists (for all but the first chunk), this file is also merged with the other files in the current chunk. Depending on the length of your filenames, and the maximum allowed length of a command line, this may require more or much more than 10 individual runs of the internal script (find
/xargs
will do this automatically).
The "internal" sh
script,
if [ -f tmpfile ]; then
sort -o tmpfile -m tmpfile "$@"
else
sort -o tmpfile -m "$@"
fi
uses sort -o tmpfile
to output to tmpfile
(this won't overwrite tmpfile
even if this is also an input to sort
) and -m
for doing the merge. In both branches, "$@"
will expand to a list of individually quoted filenames passed to the script from find
or xargs
.
Then, just run uniq -d
on tmpfile
to get all line that are duplicated:
uniq -d tmpfile >dupes.txt
If you like the "DRY" principle ("Don't Repeat Yourself"), you may write the internal script as
if [ -f tmpfile ]; then
t=tmpfile
else
t=/dev/null
fi
sort -o tmpfile -m "$t" "$@"
or
t=tmpfile
[ ! -f "$t" ] && t=/dev/null
sort -o tmpfile -m "$t" "$@"
Where did they come from?
For the same reasons as above, we can't use grep -Fx -f dupes.txt *.words
to find where these duplications came from, so instead we use find
again:
find . -type f -name '*.words' \
-exec grep -Fx -f dupes.txt {} +
Since there is no "complicated" processing to be done, we may invoke grep
directly from -exec
. The -exec
option takes a utility command and will place the found names in {}
. With +
at the end, find
will place as many arguments in place of {}
as the current shell supports in each invocation of the utility.
To be totally correct, one may want to use either
find . -type f -name '*.words' \
-exec grep -H -Fx -f dupes.txt {} +
or
find . -type f -name '*.words' \
-exec grep -Fx -f dupes.txt /dev/null {} +
to be sure that filenames are always included in the output from grep
.
The first variation uses grep -H
to always output matching filenames. The last variation uses the fact that grep
will include the name of the matching file if more than one file is given on the command line.
This matters since the last chunk of filenames sent to grep
from find
may actually only contain a single filename, in which case grep
would not mention it in its results.
Bonus material:
Dissecting the find
+xargs
+sh
command:
find . -type f -name '*.words' -print0 |
xargs -0 sh -c '
if [ -f tmpfile ]; then
sort -o tmpfile -m tmpfile "$@"
else
sort -o tmpfile -m "$@"
fi' sh
find . -type f -name '*.words'
will simply generate a list of pathnames from the current directory (or below) where each pathnames is that of a regular file (-type f
) and that has a filename component at the end that matches *.words
. If only the current directory is to be searched, one may add -maxdepth 1
after the .
, before -type f
.
-print0
will ensure that all found pathnames are outputted with a \0
(nul
) character as delimiter. This is a character that is not valid in a Unix path and it enables us to process pathnames even if they contain newline characters (or other weird things).
find
pipes its output to xargs
.
xargs -0
will read the \0
-delimited list of pathnames and will execute the given utility repeatedly with chunks of these, ensuring that the utility is executed with just enough arguments to not cause the shell to complain about a too long argument list, until there is no more input from find
.
The utility invoked by xargs
is sh
with a script given on the command line as a string using its -c
flag.
When invoking sh -c '...some script...'
with arguments following, the arguments will be available to the script in $@
, except for the first argument, which will be placed in $0
(this is the "command name" that you may spot in e.g. top
if you are quick enough). This is why we insert the string sh
as the first argument after the end of the actual script. The string sh
is a dummy argument and could be any single word (some seem to prefer _
or sh-find
).
Best Answer
Using
awk
should be efficient enough - it provides builtin associative arrays, where the key lookup time is logarithmically proportional to the number of keys (of your lookup table - which is relatively small in your example).For your input this would be:
(where M means 10^6)
In case your awk uses hash tables, every key lookup would only cost a constant amount of time.
An example of an efficient awk based solution (using the default field separator):
Since both inputs are sorted you could write a script that would be more efficient (with a runtime scaling linearly with both input file sizes). But it would cost more time programming it.
Or you could use
join
which expect sorted files as input - restriction is that your key needs to be alphabetically sorted - and perhaps you have to tweak the output format. For example:Use
-t
to configure the field separator and-o
to adjust the output format.This should run in time linear to the input size.