Grepping string, but include all non-blank lines following each grep match

greptext processing

Consider the following toy example:

this is a line 
this line contains FOO 
this line is not blank

This line also contains FOO

Some random text

This line contains FOO too
Not blank 
Also not blank

More random text 
FOO!
Yet more random text
FOO!

So, I want the results of a grep for FOO, but with the extra wrinkle that lines following the matching lines should be included, as long as they are not blank, and they do not themselves contain FOO. So the matches would look as follows, with the different matches separated:

MATCH 1

this line contains FOO 
this line is not blank

MATCH 2

This line also contains FOO

MATCH 3

This line contains FOO too 
Not blank 
Also not blank

MATCH 4

FOO!
Yet more random text

MATCH 5

FOO!

Bonus points (metaphorically speaking) for a simple single line script that can be run on the command line.

ADDENDUM: Adding a running count of the match number would be quite handy, if it is not too hard.

Best Answer

Using awk rather than grep:

awk '/FOO/ { if (matching) printf("\n"); matching = 1 }
     /^$/  { if (matching) printf("\n"); matching = 0 }
     matching' file

A version that enumerates the matches:

awk 'function flush_print_maybe() {
         if (matching) printf("Match %d\n%s\n\n", ++n, buf)
         buf = ""
     }
     /FOO/ { flush_print_maybe(); matching = 1 }
     /^$/  { flush_print_maybe(); matching = 0 }
     matching { buf = (buf == "" ? $0 : buf ORS $0) }
     END   { flush_print_maybe() }' file

Both awk programs uses a very simple "state machine" to determine if it's currently matching or not matching. A match of the pattern FOO will cause it to enter the matching state, and a match of the pattern ^$ (an empty line) will cause it to enter the non-matching state.

Output of empty lines between matching sets of data happens at state transitions from matching (either into matching or into non-matching).

The first program prints any line when in the matching state.

The second program collects lines in a buf variable when in a matching state. It flushes (empties) this after possibly printing it (depending on the state), together with a Match N label at state transitions (when the first program would output an empty line).

Output of this last program on the sample data:

Match 1
this line contains FOO
this line is not blank

Match 2
This line also contains FOO

Match 3
This line contains FOO too
Not blank
Also not blank

Match 4
FOO!
Yet more random text

Match 5
FOO!

Related Solutions

How to grep a directory based on the contents of two successive lines

@warl0ck pointed me in the right direction with pcregrep, but I said "contains", not "is", and I asked about a directory, not a file.

This seems to work for me.

pcregrep -rMi 'Foo(.*)\n(.*)Bar' .

Grep Unicode – Find All Lines Containing Japanese Kanjis

It is impossible (without using a huge table) to tell apart a japanese kanji from a Han ideograph not used in Japanese (eg, a chinese or korean variant).

If you just want to detect any Han ideograph in the basic range (\u4e00 to \u9fff) then they are encoded in 3 bytes, the first byte is always between 0xe4 and 0xe9, the second and third bytes between 0x80 and 0xbf.

There are two difficulties here, first you have to tell grep you want to look after bytes and not characters; then you have to type the 0xe4, 0xe9, 0x80 and 0xbf bytes to put them in the regexp expression.

I discovered the -P switch does both; and the line you want is:

grep -P "[\xe4-\xe9][\x80-\xbf][\x80-\xbf]"

and if you want kana too:

grep -P "[\xe4-\xe9][\x80-\xbf][\x80-\xbf]|\xe3[\x81-\x83][\x80-\xbf]"

Best Answer

Related Solutions

How to grep a directory based on the contents of two successive lines

Grep Unicode – Find All Lines Containing Japanese Kanjis

Related Question