How to grep thousands of files in a directory for hundreds of strings in a file

grep

I am trying to compose a grep statement and it is killing me. I am also tired of getting the arguments list too long error. I have a file, let's call it subset.txt. It contains hundreds of lines with specific strings such as MO43312948. In my object directory I have thousands of files and I need to copy all the files that contain the strings listed in subset.txt into another directory.

I was trying to start with this to just return the matching files from the objects directory.

grep -F "$(subset.txt)" /objects/*

I keep getting `bash: /bin/grep: Argument list too long“

Best Answer

You can pass a directory as a target to grep with -R and a file of input patterns with -f:

  -f FILE, --file=FILE
          Obtain patterns from FILE, one per line.  If this option is used
          multiple  times  or  is  combined with the -e (--regexp) option,
          search for all patterns given.  The  empty  file  contains  zero
          patterns, and therefore matches nothing.

   -R, --dereference-recursive
          Read all files under each directory,  recursively.   Follow  all
          symbolic links, unlike -r.

So, you're looking for:

grep -Ff subset.txt -r objects/

You can get the list of matching files with:

grep -Flf subset.txt -r objects/

So, if your final list isn't too long, you can just do:

 mv $(grep -Flf subset.txt -r objects/) new_dir/

If that returns an argument list too long error, use:

grep -Flf subset.txt -r objects/ | xargs -I{} mv {} bar/

And if your file names can contain spaces or other strange characters, use (assuming GNU grep):

grep -FZlf subset.txt -r objects/ | xargs -0I{} mv {} bar/

Finally, if you want to exclude binary files, use:

grep -IFZlf subset.txt -r objects/ | xargs -0I{} mv {} bar/

Related Solutions

Grep a directory and return list with line numbers

Many grep variants implement a recursive option. E.g., GNU grep

-R, -r, --recursive
          Read all files under each directory, recursively; this is equivalent to the -d recurse option.

You can then remove find:

grep -n -r $pattern $path | awk '{ print $1 }'

but this keeps more than the line number. awk is printing the first column. This example

src/main/package/A.java:3:import java.util.Map;
src/main/package/A.java:5:import javax.security.auth.Subject;
src/main/package/A.java:6:import javax.security.auth.callback.CallbackHandler;

will be printed as

src/main/package/A.java:3:import
src/main/package/A.java:5:import
src/main/package/A.java:6:import

notice the :import in each line. You might want to use sed to filter the output.

Since a : could be present in the file name you can use the -Z option of grep to output a nul character (\0) after the file name.

grep -rZn $pattern $path | sed -e "s/[[:cntrl:]]\([0-9][0-9]*\).*/:\1/"

with the same example as before will produce

src/main/package/A.java:3
src/main/package/A.java:5
src/main/package/A.java:6

Grep Performance – Search in Thousands of Files Efficiently

With find:

cd /the/dir
find . -type f -exec grep pattern {} +

(-type f is to only search in regular files (also excluding symlinks even if they point to regular files). If you want to search in any type of file except directories (but beware there are some types of files like fifos or /dev/zero that you generally don't want to read), replace -type f with the GNU-specific ! -xtype d (-xtype d matches for files of type directory after symlink resolution)).

With GNU grep:

grep -r pattern /the/dir

(but beware that unless you have a recent version of GNU grep, that will follow symlinks when descending into directories). Non-regular files won't be searched unless you add a -D read option. Recent versions of GNU grep will still not search inside symlinks though.

Very old versions of GNU find did not support the standard {} + syntax, but there you could use the non-standard:

cd /the/dir &&
  find . -type f -print0 | xargs -r0 grep pattern

Performances are likely to be I/O bound. That is the time to do the search would be the time needed to read all that data from storage.

If the data is on a redundant disk array, reading several files at a time might improve performance (and could degrade them otherwise). If the performances are not I/O bound (because for instance all the data is in cache), and you have multiple CPUs, concurrent greps might help as well. You can do that with GNU xargs's -P option.

For instance, if the data is on a RAID1 array with 3 drives, or if the data is in cache and you have 3 CPUs whose time to spare:

cd /the/dir &&
  find . -type f -print0 | xargs -n1000 -r0P3 grep pattern

(here using -n1000 to spawn a new grep every 1000 files, up to 3 running in parallel at a time).

However note that if the output of grep is redirected, you'll end up with badly interleaved output from the 3 grep processes, in which case you may want to run it as:

find . -type f -print0 | stdbuf -oL xargs -n1000 -r0P3 grep pattern

(on a recent GNU or FreeBSD system) or use the --line-buffered option of GNU grep.

If pattern is a fixed string, adding the -F option could improve matters.

If it's not multi-byte character data, or if for the matching of that pattern, it doesn't matter whether the data is multi-byte character or not, then:

cd /the/dir &&
  LC_ALL=C grep -r pattern .

could improve performance significantly.

If you end up doing such searches often, then you may want to index your data using one of the many search engines out there.

Best Answer

Related Solutions

Grep a directory and return list with line numbers

Grep Performance – Search in Thousands of Files Efficiently

Related Question