Text Processing – Find Files with Matched Whole Lines from a File

awkgrepperltext processing

I have a file with this content:

$ cat compromised_header.txt
some unique string 1
some other unique string 2
another unique string 3

I wanted to find all files that have all the lines of above file exactly in the same order and those lines have no intermediary lines in between.

Example input file:

$ cat a-compromised-file.txt
some unique string 1
some other unique string 2
another unique string 3
unrelated line x
unrelated line y
unrelated line z

I tried using below grep:

grep -rlf compromised_header.txt dir/

But I wasn't sure it will give the expected files as it will also match this file:

some unique string 1
unrelated line x
unrelated line y
unrelated line z

Best Answer

Using an awk that supports nextfile:

NR == FNR {
  a[++n]=$0; next
}
$0 != a[c+1] && (--c || $0!=a[c+1]) {
  c=0; next
}
++c >= n {
  print FILENAME; c=0; nextfile
}

with find for recursion:

find dir -type f -exec gawk -f above.awk compromised_header.txt {} +

Or this might work:

pcregrep -rxlM "$( perl -lpe '$_=quotemeta' compromised_header.txt )" dir

Using perl to escape metacharacters because pcregrep doesn't seem to combine --fixed-strings with --multiline.

With perl in slurp mode (won't work with files that are too large to hold in memory):

find dir -type f -exec perl -n0777E 'BEGIN {$f=<>} say $ARGV if /^\Q$f/m
' compromised_header.txt {} +

Related Solutions

Best way to remove lines from a file where matching text (not whole line) exists in another file

You can use fgrep for this:

fgrep -v -f file1  file2  > unique_addresses

This task will be a lot easier if you have 1 email address per line in both files.

Traditionally fgrep exists as a separate program, but in GNU utilities, grep -F does the same thing.

Compare two text files and find matching lines

Since your patterns are only four to six lines, why not use them in an OR pattern? An example limiting to 10 matches that operates on a second file "bigDNA.txt":

grep -E 'GAGA|CAGA|GGGT|TATT' -m 10 bigDNA.txt

This will save you from manually typing the patterns from file patt.txt. It joins lines by | (append | to each line, remove newline, remove trailing |):

grep -E "$(sed 's#$#|#' patt.txt | tr -d '\n' | sed 's#|$##')" -m 10 bigDNA.txt

Best Answer

Related Solutions

Best way to remove lines from a file where matching text (not whole line) exists in another file

Compare two text files and find matching lines

Related Question