Shell – How to Find Files Containing Multiple Keywords

awkfindgrepshelltext processing

I'm looking for a way to list all files in a directory that contain the full set of keywords I'm seeking, anywhere in the file.

So, the keywords need not to appear on the same line.

One way to do this would be:

grep -l one $(grep -l two $(grep -l three *))

Three keywords is just an example, it could just as well be two, or four, and so on.

A second way I can think of is:

grep -l one * | xargs grep -l two | xargs grep -l three

A third method, that appeared in another question, would be:

find . -type f \
  -exec grep -q one {} \; -a \
  -exec grep -q two {} \; -a \
  -exec grep -q three {} \; -a -print

But that's definitely not the direction I'm going here. I want something that requires less typing, and possibly just one call to grep, awk, perl or similar.

For example, I like how awk lets you match lines that contain all keywords, like:

awk '/one/ && /two/ && /three/' *

Or, print just the file names:

awk '/one/ && /two/ && /three/ { print FILENAME ; nextfile }' *

But I want to find files where the keywords may be anywhere in the file, not necessarily on the same line.

Preferred solutions would be gzip friendly, for example grep has the zgrep variant that works on compressed files. Why I mention this, is that some solutions may not work well given this constraint. For example, in the awk example of printing matching files, you can't just do:

zcat * | awk '/pattern/ {print FILENAME; nextfile}'

You need to significantly change the command, to something like:

for f in *; do zcat $f | awk -v F=$f '/pattern/ { print F; nextfile }'; done

So, because of the constraint, you need to call awk many times, even though you could do it only once with uncompressed files. And certainly, it would be nicer to just do zawk '/pattern/ {print FILENAME; nextfile}' * and get the same effect, so I would prefer solutions that allow this.

Best Answer

awk 'FNR == 1 { f1=f2=f3=0; };

     /one/   { f1++ };
     /two/   { f2++ };
     /three/ { f3++ };

     f1 && f2 && f3 {
       print FILENAME;
       nextfile;
     }' *

If you want to automatically handle gzipped files, either run this in a loop with zcat (slow and inefficient because you'll be forking awk many times in a loop, once for each filename) or rewrite the same algorithm in perl and use the IO::Uncompress::AnyUncompress library module which can decompress several different kinds of compressed files (gzip, zip, bzip2, lzop). or in python, which also has modules for handling compressed files.

Here's a perl version that uses IO::Uncompress::AnyUncompress to allow for any number of patterns and any number of filenames (containing either plain text or compressed text).

All args before -- are treated as search patterns. All args after -- are treated as filenames. Primitive but effective option handling for this job. Better option handling (e.g. to support a -i option for case-insensitive searches) could be achieved with the Getopt::Std or Getopt::Long modules.

Run it like so:

$ ./arekolek.pl one two three -- *.gz *.txt
1.txt.gz
4.txt.gz
5.txt.gz
1.txt
4.txt
5.txt

(I won't list files {1..6}.txt.gz and {1..6}.txt here...they just contain some or all of the words "one" "two" "three" "four" "five" and "six" for testing. The files listed in the output above DO contain all three of the search patterns. Test it yourself with your own data)

#! /usr/bin/perl

use strict;
use warnings;
use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ;

my %patterns=();
my @filenames=();
my $fileargs=0;

# all args before '--' are search patterns, all args after '--' are
# filenames
foreach (@ARGV) {
  if ($_ eq '--') { $fileargs++ ; next };

  if ($fileargs) {
    push @filenames, $_;
  } else {
    $patterns{$_}=1;
  };
};

my $pattern=join('|',keys %patterns);
$pattern=qr($pattern);
my $p_string=join('',sort keys %patterns);

foreach my $f (@filenames) {
  #my $lc=0;
  my %s = ();
  my $z = new IO::Uncompress::AnyUncompress($f)
    or die "IO::Uncompress::AnyUncompress failed: $AnyUncompressError\n";

  while ($_ = $z->getline) {
    #last if ($lc++ > 100);
    my @matches=( m/($pattern)/og);
    next unless (@matches);

    map { $s{$_}=1 } @matches;
    my $m_string=join('',sort keys %s);

    if ($m_string eq $p_string) {
      print "$f\n" ;
      last;
    }
  }
}

A hash %patterns is contains the complete set of patterns that files have to contain at least one of each member $_pstring is a string containing the sorted keys of that hash. The string $pattern contains a pre-compiled regular expression also built from the %patterns hash.

$pattern is compared against each line of each input file (using the /o modifier to compile $pattern only once as we know it won't ever change during the run), and map() is used to build a hash (%s) containing the matches for each file.

Whenever all the patterns have been seen in the current file (by comparing if $m_string (the sorted keys in %s) is equal to $p_string), print the filename and skip to the next file.

This is not a particularly fast solution, but is not unreasonably slow. The first version took 4m58s to search for three words in 74MB worth of compressed log files (totalling 937MB uncompressed). This current version takes 1m13s. There are probably further optimisations that could be made.

One obvious optimisation is to use this in conjunction with xargs's -P aka --max-procs to run multiple searches on subsets of the files in parallel. To do that, you need to count the number of files and divide by the number of cores/cpus/threads your system has (and round up by adding 1). e.g. there were 269 files being searched in my sample set, and my system has 6 cores (an AMD 1090T), so:

patterns=(one two three)
searchpath='/var/log/apache2/'
cores=6
filecount=$(find "$searchpath" -type f -name 'access.*' | wc -l)
filespercore=$((filecount / cores + 1))

find "$searchpath" -type f -print0 | 
  xargs -0r -n "$filespercore" -P "$cores" ./arekolek.pl "${patterns[@]}" --

With that optimisation, it took only 23 seconds to find all 18 matching files. Of course, the same could be done with any of the other solutions. NOTE: The order of filenames listed in the output will be different, so may need to be sorted afterwards if that matters.

As noted by @arekolek, multiple zgreps with find -exec or xargs can do it significantly faster, but this script has the advantage of supporting any number of patterns to search for, and is capable of dealing with several different types of compression.

If the script is limited to examining only the first 100 lines of each file, it runs through all of them (in my 74MB sample of 269 files) in 0.6 seconds. If this is useful in some cases, it could be made into a command line option (e.g. -l 100) but it has the risk of not finding all matching files.

BTW, according to the man page for IO::Uncompress::AnyUncompress, the compression formats supported are:

zlib RFC 1950,

deflate RFC 1951 (optionally),

gzip RFC 1952,

zip,

bzip2,

lzop,

lzf,

lzma,

xz

One last (I hope) optimisation. By using the PerlIO::gzip module (packaged in debian as libperlio-gzip-perl) instead of IO::Uncompress::AnyUncompress I got the time down to about 3.1 seconds for processing my 74MB of log files. There were also some small improvements by using a simple hash rather than Set::Scalar (which also saved a few seconds with the IO::Uncompress::AnyUncompress version).

PerlIO::gzip was recommended as the fastest perl gunzip in https://stackoverflow.com/a/1539271/137158 (found with a google search for perl fast gzip decompress)

Using xargs -P with this didn't improve it at all. In fact it even seemed to slow it down by anywhere from 0.1 to 0.7 seconds. (I tried four runs and my system does other stuff in the background which will alter the timing)

The price is that this version of the script can only handle gzipped and uncompressed files. Speed vs flexibility: 3.1 seconds for this version vs 23 seconds for the IO::Uncompress::AnyUncompress version with an xargs -P wrapper (or 1m13s without xargs -P).

#! /usr/bin/perl

use strict;
use warnings;
use PerlIO::gzip;

my %patterns=();
my @filenames=();
my $fileargs=0;

# all args before '--' are search patterns, all args after '--' are
# filenames
foreach (@ARGV) {
  if ($_ eq '--') { $fileargs++ ; next };

  if ($fileargs) {
    push @filenames, $_;
  } else {
    $patterns{$_}=1;
  };
};

my $pattern=join('|',keys %patterns);
$pattern=qr($pattern);
my $p_string=join('',sort keys %patterns);

foreach my $f (@filenames) {
  open(F, "<:gzip(autopop)", $f) or die "couldn't open $f: $!\n";
  #my $lc=0;
  my %s = ();
  while (<F>) {
    #last if ($lc++ > 100);
    my @matches=(m/($pattern)/ogi);
    next unless (@matches);

    map { $s{$_}=1 } @matches;
    my $m_string=join('',sort keys %s);

    if ($m_string eq $p_string) {
      print "$f\n" ;
      close(F);
      last;
    }
  }
}

Best Answer

Related Solutions

Grep Find Recursive – How to Find Files That Do NOT Contain a Text String

Bash – can I do : find … -exec this && that

Related Question