Multiline Regexp (grep, sed, awk, perl)

awkgrepregular expressionsed

I know that multiline regexp has been discussed dozens of times but I just can't get it to work with my pattern.

I'll try to explain.
I have some text files in a directory.
Example of text in a file:

LINE OF TEXT 2
LINE OF TEXT 1
LINE OF TEXT 3

LINE OF TEXT 1
LINE OF TEXT 2
LINE OF TEXT 3

LINE OF TEXT 1
LINE OF TEXT 3

LINE OF TEXT 3
LINE OF TEXT 2
LINE OF TEXT 1

LINE OF TEXT 2
LINE OF TEXT 3

I want to find "LINE OF TEXT 3" which comes after "LINE OF TEXT 2" which in turn comes after "LINE OF TEXT 1" (with no empty lines in between).

Each line must be a regexp itself, for example a line starts with "LINE" and ends with a particular number.

Note: Not all files contain that exact line sequence, so if a pattern match then don't print the pattern but just print the filename to STDOUT.

Can this be done in a one-liner regexp? So, for example, awk searches a pattern in a file and prints filename to STDOUT if a pattern found. I then can use this regexp in a combination with "find -exec".

Any mentioned tool will go (grep, awk, sed or perl).

Best Answer

You can do this with Awk by setting the "Record Separator" variable to be a regex matching at least two consecutive newline characters:

awk -v RS='\n\n+' '/1.*2.*3/' file.txt

You can also set the "Field Separator" to be a single newline character:

awk -v RS='\n\n+' -F '\n' '$1 == "LINE OF TEXT 1" && $2 == "LINE OF TEXT 2" && $3 == "LINE OF TEXT 3"' file.txt

Broken up for readability:

awk -v RS='\n\n+' -F '\n' '
  $1 == "LINE OF TEXT 1" &&
  $2 == "LINE OF TEXT 2" &&
  $3 == "LINE OF TEXT 3"
' file.txt

With your requirement of only printing the filename if a match is found, you can do this like so:

awk -v RS='\n\n+' -F '\n' '
  $1 == "LINE OF TEXT 1" &&
  $2 == "LINE OF TEXT 2" &&
  $3 == "LINE OF TEXT 3" {
    match++
  }
  END {
    if (match) {
      print FILENAME
    }
' file.txt

But considering you are talking about using find in combination with awk, I'd recommend just using Awk for the exit status and using find for the printing:

find . -type f -exec awk -v RS='\n\n+' -F '\n' '
  $1 ~ /LINE OF TEXT 1/ &&
  $2 ~ /LINE OF TEXT 2/ &&
  $3 ~ /LINE OF TEXT 3/ {
    exit 0
  }
  END { exit 1 }
' {} \; -print

That way, if you want to do something else before printing (some other find primary), you're already set up to do so.

Related Solutions

Replacing string in all files found by grep. Can’t get it to work

Typically, when you get a > in the next line after hitting, it means that one of your quotes isn't closed yet. I couldn't find that mistake in your regex. But you do not need to surround the path /var/www_data/somepath/ with single quotes. I assume there are no unusual characters in somepath?

Anyways, I tested your regex with sed. \d\w look like vim syntax for me, that's why I translated it to ascii (which always works). Also, inside of [] you do not need to escape .:

sed -r "s/'([A-Za-z0-9_-.]+)(@domain.com)'/'adsf'/g" test.dat

Indeed you can use sed or perl for your task. You don't necessarily need grep to generate a file list, unless you have GB of data. Then presorting could result in a speed benefit.

To test your regex, you could do the following:

cd /var/www_data/somepath/
sed -r 's|pattern|replace-pattern|g' a_single_file.php

When you're satisfied with the result, just add the -ibak (--in-place=bak) argument and run it on all files

find . -type f -name '*.php' -o -name '*.ini' -o name '*.conf' -o -name '*.sh' \
-exec sed -r -ibak 's|pattern|replace-pattern|g' '{}' \;

The original files are being put into <orignalname.php>.bak.

To answer your last question. For this job, grep is the tool you want, you could run it on the .bak files generated by sed above:

grep --recursive --include='*.bak' -E --files-with-matches 'pattern' . > files_fixed.txt

or, simply:

find . -type f -name '*.bak'

How to delete all text between curly brackets in a multiline text file

$ sed ':again;$!N;$!b again; s/{[^}]*}//g' file
This is 
that wants
 anyway.

Explanation:

:again;$!N;$!b again;

This reads the whole file into the pattern space.

:again is a label. N reads in the next line. $!b again branches back to the again label on the condition that this is not the last line.
s/{[^}]*}//g

This removes all expressions in braces.

On Mac OSX, try:

sed -e ':again' -e N -e '$!b again' -e 's/{[^}]*}//g' file

Nested Braces

Let's take this as a test file with lots of nested braces:

a{b{c}d}e
1{2
}3{
}
5

Here is a modification to handle nested braces:

$ sed ':again;$!N;$!b again; :b; s/{[^{}]*}//g; t b' file2
ae
13
5