Recursive grep for words in a particular file type

command linegrep

I wanted a command line command to search all shell scripts in the filesystem for a particular word, so I asked around at work and got the following solutions:

grep word `find / -name \*.sh 2>/dev/null`
find / -name "*.sh" 2>/dev/null | xargs grep word

However, I'm not that familiar with the command line, so both of these solutions seem opaque to me. I'd prefer to do something that looks like:

ls -r *.sh | cat | grep -H word

But it seems that you can't pipe filenames into cat (at least I think that's what the problem is).

What is the most legible solution? And secondly, what is the most efficient solution?

Edit: I needed to know which file the word was found in, so I could modify the script.

Best Answer

Edit: If you have GNU utilities, see Gilles' answer for a method using GNU grep's recursion abilities that is much simpler than the find approach. If you only want to display filenames, you'll still want to add the -l option as I describe below.

Use grep -l word to only print names of files containing a match.

If you want to find all files in the file system ending in .sh, starting at the root /, then find is the most appropriate tool.

The most portable and efficient recommendation is:

find / -type f -name '*.sh' -exec grep -l word {} + 2>/dev/null

This is about as readable as it gets, and is not hard to parse if you understand the semantics behind each of the components.

find /: run find starting at the file system root, /
-type f: only match regular files
-name '*.sh': ... and only match files whose names end in .sh
-exec ... {} +: run command specified in ... on matched files in groups, where {} is replaced by the file names in the group. The idea is to run the command on as many files at once as possible within the limits of the system (ARG_MAX). The efficiency of the {} + form comes from minimizing the number of times the ... command must be called by maximizing the number of files passed to each invocation of ....
grep -l word {}: where the {} is the same {} repeated from above and is replaced by file names. As previously explained, grep -l prints the names of files containing a match for word.
2>/dev/null: hide error messages (technically, redirect standard error to the black hole that is /dev/null). This is for aesthetic and practical reasons, since running find on / will likely result in reams of "permission denied" messages you may not care about for files which you do not have permission to read and directories you do not have permission to traverse.

There are some problems with the suggestions you received and posted in your question. Both

grep word `find / -name \*.sh 2>/dev/null

and

find / -name "*.sh" 2>/dev/null | xargs grep word

fail on files with whitespace in their name. It's best to avoid putting filenames in command substitution altogether. The first one has the additional problem of potentially running into the ARG_MAX limit. The second one is close to what I suggest, but there is no good reason to use xargs here, not to mention that safe and correct usage of xargs requires sacrificing portability for some GNU-only options (find -print0 | xargs -0).

Related Solutions

Using xargs to grep multiple patterns

Yes, find ./work -print0 | xargs -0 rm will execute something like rm ./work/a "work/b c" .... You can check with echo, find ./work -print0 | xargs -0 echo rm will print the command that will be executed (except white space will be escaped appropriately, though the echo won't show that).

To get xargs to put the names in the middle, you need to add -I[string], where [string] is what you want to be replaced with the argument, in this case you'd use -I{}, e.g. <strings.txt xargs -I{} grep {} directory/*.

What you actually want to use is grep -F -f strings.txt:

-F, --fixed-strings
  Interpret PATTERN as a  list  of  fixed  strings,  separated  by
  newlines,  any  of  which is to be matched.  (-F is specified by
  POSIX.)
-f FILE, --file=FILE
  Obtain  patterns  from  FILE,  one  per  line.   The  empty file
  contains zero patterns, and therefore matches nothing.   (-f  is
  specified by POSIX.)

So grep -Ff strings.txt subdirectory/* will find all occurrences of any string in strings.txt as a literal, if you drop the -F option you can use regular expressions in the file. You could actually use grep -F "$(<strings.txt)" directory/* too. If you want to practice find, you can use the last two examples in the summary. If you want to do a recursive search instead of just the first level, you have a few options, also in the summary.

Summary:

# grep for each string individually.
<strings.txt xargs -I{} grep {} directory/*

# grep once for everything
grep -Ff strings.txt subdirectory/*
grep -F "$(<strings.txt)" directory/*

# Same, using file
find subdirectory -maxdepth 1 -type f -exec grep -Ff strings.txt {} +
find subdirectory -maxdepth 1 -type f -print0 | xargs -0 grep -Ff strings.txt

# Recursively
grep -rFf strings.txt subdirectory
find subdirectory -type f -exec grep -Ff strings.txt {} +
find subdirectory -type f -print0 | xargs -0 grep -Ff strings.txt

You may want to use the -l option to get just the name of each matching file if you don't need to see the actual line:

-l, --files-with-matches
  Suppress  normal  output;  instead  print the name of each input
  file from which output would normally have  been  printed.   The
  scanning  will  stop  on  the  first match.  (-l is specified by
  POSIX.)

Command Line – Find the Most Frequent Words in a File

That's pretty much the most common way of finding "N most common things", except you're missing a sort, and you've got a gratuitious cat:

tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head  -10

If you don't put in a sort before the uniq -c you'll probably get a lot of false singleton words. uniq only does unique runs of lines, not overall uniquness.

EDIT: I forgot a trick, "stop words". If you're looking at English text (sorry, monolingual North American here), words like "of", "and", "the" almost always take the top two or three places. You probably want to eliminate them. The GNU Groff distribution has a file named eign in it which contains a pretty decent list of stop words. My Arch distro has /usr/share/groff/current/eign, but I think I've also seen /usr/share/dict/eign or /usr/dict/eign in old Unixes.

You can use stop words like this:

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head  -10

My guess is that most human languages need similar "stop words" removed from meaningful word frequency counts, but I don't know where to suggest getting other languages stop words lists.

EDIT: fgrep should use the -w command, which enables whole-word matching. This avoids false positives on words that merely contain short stop works, like "a" or "i".

Best Answer

Related Solutions

Using xargs to grep multiple patterns

Command Line – Find the Most Frequent Words in a File

Related Question