Grep Find – Search for Text Files Where Two Different Words Exist

findgrepsearch

I'm looking for a way to search files where two word instances exist in the same file. I've been using the following to perform my searches up to this point:

find . -exec grep -l "FIND ME" {} \;

The problem I'm running into is that if there isn't exactly one space that between "FIND" and "ME", the search result does not yield the file. How do I adapt the former search string where both words "FIND" and "ME exist in a file as opposed to "FIND ME"?

I'm using AIX.

Best Answer

With GNU tools:

find . -type f  -exec grep -lZ FIND {} + | xargs -r0 grep -l ME

You can do standardly:

find . -type f -exec grep -q FIND {} \; -exec grep -l ME {} \;

But that would run up to two greps per file. To avoid running that many greps and still be portable while still allowing any character in file names, you could do:

convert_to_xargs() {
  sed "s/[[:blank:]\"\']/\\\\&/g" | awk '
    {
      if (NR > 1) {
        printf "%s", line
        if (!index($0, "//")) printf "\\"
        print ""
      }
      line = $0
    }'
    END { print line }'
}

export LC_ALL=C
find .//. -type f |
  convert_to_xargs |
  xargs grep -l FIND |
  convert_to_xargs |
  xargs grep -l ME

The idea being to convert the output of find into a format suitable for xargs (that expects a blank (SPC/TAB/NL in the C locale, YMMV in other locales) separated list of words where single, double quotes and backslashes can escape blanks and each other).

Generally you can't post-process the output of find -print, because it separates the file names with a newline character and doesn't escape the newline characters that are found in file names. For instance if we see:

./a
./b

We've got no way to know whether it's one file called b in a directory called a<NL>. or if it's the two files a and b in the current directory.

By using .//., because // cannot appear otherwise in a file path as output by find (because there's no such thing as a directory with an empty name and / is not allowed in a file name), we know that if we see a line that contains //, then that's the first line of a new filename. So we can use that awk command to escape all newline characters but those that precede those lines.

If we take the example above, find would output in the first case (one file):

.//a
./b

Which awk escapes to:

.//a\
./b

So that xargs sees it as one argument. And in the second case (two files):

.//a
.//b

Which awk would leave as is, so xargs sees two arguments.

You need the LC_ALL=C so sed, awk (and some implementations of xargs) work for arbitrary sequences of bytes (even though that don't form valid characters in the user's locale), to simplify the blank definition to just SPC and TAB and to avoid problems with different interpretations of characters whose encoding contains the encoding of backslash by the different utilities.

Related Solutions

How to use the results of “file” (Name of Creating Application: Microsoft Word) to search for a specific string

If I understand correctly, you want to move files from the current directory and its subdirectories recursively to another directory, but only if the file command reports them as “Microsoft Word” files. That is, you're interested in the files for which file "$filename" | grep 'Microsoft Word' produces some output.

An easy way is to take things calmly and to it file by file. If you only want the files in the current directory, you can use a for loop and a wildcard pattern:

for f in *.doc; do
  if …
done

What's the condition? We want to test if Microsoft Word appears in the output of file "$f". I use file -- to protect against files whose name begins with -.

for f in *.doc; do
  if file -- "$f" | grep -s 'Microsoft Word'; then
    …
  fi
done

All we need to do is add the command to move the files.

for f in *.doc; do
  if file -- "$f" | grep -s 'Microsoft Word'; then
    mv -- "$f" ../NewDirectory/
  fi
done

If you want to look for files in subdirectories as well, use the ** wilcdard pattern for recursive globbing. In bash, it needs to be activated with shopt -s globstar (in ksh93, you need set -o globstar, and in zsh it works out of the box; other shells lack this feature). Beware that bash ≤4.2 follows symbolic links to directories.

for f in **/*.doc; do
  if file -- "$f" | grep -s 'Microsoft Word'; then
    mv -- "$f" ../NewDirectory/
  fi
done

Note that all moved files end in ../NewDirectory/, no subdirectories are created. If you want to reproduce the directory tree, you can use string manipulation constructs to extract the directory part of the file name and mkdir -p to create the target directory if necessary.

for f in ./**/*.doc; do
  if file "$f" | grep -s 'Microsoft Word'; then
    d="${f%/*}"
    mkdir -p ../NewDirectory/"$d"
    mv "$f" ../NewDirectory/"$d"
  fi
done

Rather than parse the output of file, which is somewhat fragile, you might prefer to parse file -i, which prints standardized strings.

Linux – Using Grep to Find Multiple Repeating Characters in a Word

This calls for backreferences!

If you are ever referring to something you have already matched, and you want to match it again, use backreferences.

grep '(..)(.*\1){<n - 1>}' <file>

.* matches any sequence of characters
(..) matches any two characters
\1 matches the first group, in this case the (..) near the beginning

Substitute <n - 1> for the length of the sequence minus one, and <file> with the name of the file you want to look for (or omit it to use stdin).

This may not be the most efficient solution, but it works.

Best Answer

Related Solutions

How to use the results of “file” (Name of Creating Application: Microsoft Word) to search for a specific string

Linux – Using Grep to Find Multiple Repeating Characters in a Word

Related Question