This answer is based on the awk
answer posted by potong..
It is twice as fast as the comm
method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)
Although awk
is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...
awk '{ if (/^[^0-9]/) { next } # Skip lines which do not hold key values
if (FNR==NR) { main[$0]=1 } # Process keys from file "mainfile"
else if (main[$0]==0) { keys[$0]=1 } # Process keys from file "keys"
} END { for(key in keys) print key }' \
"mainfile" "keys" >"keys.not-in-main"
# For 6 million lines in "mainfile" and 10 thousand keys in "keys"
# The awk method
# time:
# real 0m14.495s
# user 0m14.457s
# sys 0m0.044s
# The comm method
# time:
# real 0m27.976s
# user 0m28.046s
# sys 0m0.104s
Yes, find ./work -print0 | xargs -0 rm
will execute something like rm ./work/a "work/b c" ...
. You can check with echo
, find ./work -print0 | xargs -0 echo rm
will print the command that will be executed (except white space will be escaped appropriately, though the echo
won't show that).
To get xargs
to put the names in the middle, you need to add -I[string]
, where [string]
is what you want to be replaced with the argument, in this case you'd use -I{}
, e.g. <strings.txt xargs -I{} grep {} directory/*
.
What you actually want to use is grep -F -f strings.txt
:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
So grep -Ff strings.txt subdirectory/*
will find all occurrences of any string in strings.txt
as a literal, if you drop the -F
option you can use regular expressions in the file. You could actually use grep -F "$(<strings.txt)" directory/*
too. If you want to practice find
, you can use the last two examples in the summary. If you want to do a recursive search instead of just the first level, you have a few options, also in the summary.
Summary:
# grep for each string individually.
<strings.txt xargs -I{} grep {} directory/*
# grep once for everything
grep -Ff strings.txt subdirectory/*
grep -F "$(<strings.txt)" directory/*
# Same, using file
find subdirectory -maxdepth 1 -type f -exec grep -Ff strings.txt {} +
find subdirectory -maxdepth 1 -type f -print0 | xargs -0 grep -Ff strings.txt
# Recursively
grep -rFf strings.txt subdirectory
find subdirectory -type f -exec grep -Ff strings.txt {} +
find subdirectory -type f -print0 | xargs -0 grep -Ff strings.txt
You may want to use the -l
option to get just the name of each matching file if you don't need to see the actual line:
-l, --files-with-matches
Suppress normal output; instead print the name of each input
file from which output would normally have been printed. The
scanning will stop on the first match. (-l is specified by
POSIX.)
Best Answer
You could use
grep -o
to print only the matching part and use the result as patterns for a secondgrep -v
on the originalpatterns.txt
file:Though in this particular case you could also use
join
+sort
: