Grep Find – Search for Text Files Where Two Different Words Exist

findgrepsearch

I'm looking for a way to search files where two word instances exist in the same file. I've been using the following to perform my searches up to this point:

find . -exec grep -l "FIND ME" {} \;

The problem I'm running into is that if there isn't exactly one space that between "FIND" and "ME", the search result does not yield the file. How do I adapt the former search string where both words "FIND" and "ME exist in a file as opposed to "FIND ME"?

I'm using AIX.

Best Answer

With GNU tools:

find . -type f  -exec grep -lZ FIND {} + | xargs -r0 grep -l ME

You can do standardly:

find . -type f -exec grep -q FIND {} \; -exec grep -l ME {} \;

But that would run up to two greps per file. To avoid running that many greps and still be portable while still allowing any character in file names, you could do:

convert_to_xargs() {
  sed "s/[[:blank:]\"\']/\\\\&/g" | awk '
    {
      if (NR > 1) {
        printf "%s", line
        if (!index($0, "//")) printf "\\"
        print ""
      }
      line = $0
    }'
    END { print line }'
}

export LC_ALL=C
find .//. -type f |
  convert_to_xargs |
  xargs grep -l FIND |
  convert_to_xargs |
  xargs grep -l ME

The idea being to convert the output of find into a format suitable for xargs (that expects a blank (SPC/TAB/NL in the C locale, YMMV in other locales) separated list of words where single, double quotes and backslashes can escape blanks and each other).

Generally you can't post-process the output of find -print, because it separates the file names with a newline character and doesn't escape the newline characters that are found in file names. For instance if we see:

./a
./b

We've got no way to know whether it's one file called b in a directory called a<NL>. or if it's the two files a and b in the current directory.

By using .//., because // cannot appear otherwise in a file path as output by find (because there's no such thing as a directory with an empty name and / is not allowed in a file name), we know that if we see a line that contains //, then that's the first line of a new filename. So we can use that awk command to escape all newline characters but those that precede those lines.

If we take the example above, find would output in the first case (one file):

.//a
./b

Which awk escapes to:

.//a\
./b

So that xargs sees it as one argument. And in the second case (two files):

.//a
.//b

Which awk would leave as is, so xargs sees two arguments.

You need the LC_ALL=C so sed, awk (and some implementations of xargs) work for arbitrary sequences of bytes (even though that don't form valid characters in the user's locale), to simplify the blank definition to just SPC and TAB and to avoid problems with different interpretations of characters whose encoding contains the encoding of backslash by the different utilities.

Related Question