Filter separated paragraphs according to their first word

sedtext processing

I have a program which prints out lines of text ("paragraphs") separated by '–'. For example it might print

--
are you happy
--
I am hungry
are you
--
are(you hungry
too

I want to pipe that into another program (sed maybe?) and get back just the paragraphs that start with a given word (e.g. "are"). So in the above case getting paragraphs that begin with "are" back I'd get

--
are you happy
--
are(you hungry
too

The program prints out a potentially very large number of "paragraphs" but I expect only a small number to match, which is why I would prefer to be able to filter the program's output in a streaming way (avoiding writing everything to a huge file and then filtering it).

Best Answer

AWK

Using GNU awk or mawk:

$ awk '$1~"^"word{printf("--\n%s",$0)}' word='are' RS='--\n' infile
--
are you happy
--
are(you hungry
too

This sets the variable word to the word to match at the beginning of the record and RS (record separator) to '--' followed by a new line \n. Then, for any record which starts with the word to match ($1~"^"word) print a formatted record. The format is a starting '--' with a new line with the exact record found.

GREP

Using (GNU for the -z option) grep:

grep -Pz -- '--\nare(?:[^\n]*\n)+?(?=--|\Z)' infile
grep -Pz -- '(?s)--\nare.*?(?=\n--|\Z)\n' infile
grep -Pz -- '(?s)--\nare(?:(?!\n--).)*\n' infile

Description(s) For the following descriptions, the PCRE option (?x) is used to add (a lot) of explaining comments (and spaces) inline with the actual (working) regex. If the comments (and most spaces) (up to the next newline) are removed, the resulting string is still the same regex. This allow the description of the regex in detail in working code. This makes code maintenance a lot easier.

Option 1 regex (?x)--\nare(?:[^\n]*\n)+?(?=--|\Z)

(?x)   # match the remainder of the pattern with the following
       # effective flags: x
       #      x modifier: extended. Spaces and text after a # 
       #      in the pattern are ignored
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
(?:    #      Non-Capturing Group (?:[^\n]*\n)+?
[^\n]  #           matches non-newline characters
*      #           Quantifier — Matches between zero and unlimited times, as
       #           many times as possible, giving back as needed (greedy)
\n     #           matches a line-feed (newline) character (ASCII 10)
)      #      Close the Non-Capturing Group
+?     # Quantifier — Matches between one and unlimited times, as
       # few times as possible, expanding as needed (lazy)
       # A repeated capturing group will only capture the last iteration.
       # Put a capturing group around the repeated group to capture all
       # iterations or use a non-capturing group instead if you're not
       # interested in the data
(?=    # Positive Lookahead (?=--|\Z)
       # Assert that the Regex below matches
       #      1st Alternative --
--     #           matches the characters -- literally (case sensitive)
|      #      2nd Alternative \Z
\Z     #           \Z asserts position at the end of the string, or before
       #           the line terminator right at the end of the 
       #           string (if any)
)      #      Closing the lookahead.

Option 2 regex (?sx)--\nare.*?(?=\n--|\Z)\n

(?sx)  # match the remainder of the pattern with the following eff. flags: sx
       #        s modifier: single line. Dot matches newline characters
       #        x modifier: extended. Spaces and text after a # in 
       #        the pattern are ignored
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
.*?    # matches any character 
       #        Quantifier — Matches between zero and unlimited times,
       #        as few times as possible, expanding as needed (lazy).
(?=    # Positive Lookahead (?=\n--|\Z)
       # Assert that the Regex below matches
       #        1st Alternative \n--
\n     #               matches a line-feed (newline) character (ASCII 10)
--     #               matches the characters -- literally.
|      #        2nd Alternative \Z
\Z     #               \Z asserts position at the end of the string, or
       #               before the line terminator right at
       #               the end of the string (if any)
)      # Close the lookahead parenthesis.
\n     #        matches a line-feed (newline) character (ASCII 10)
 

Option 3 regex (?xs)--\nare(?:(?!\n--).)*\n

(?xs)  # match the remainder of the pattern with the following eff. flags: xs
       # modifier x : extended. Spaces and text after a # in are ignored
       # modifier s : single line. Dot matches newline characters
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
(?:    # Non-capturing group (?:(?!\n--).)
(?!    #      Negative Lookahead (?!\n--)
       #           Assert that the Regex below does not match
\n     #                matches a line-feed (newline) character (ASCII 10)
--     #                matches the characters -- literally
)      #      Close Negative lookahead
.      #      matches any character
)      # Close the Non-Capturing group.
*      # Quantifier — Matches between zero and unlimited times, as many
       # times as possible, giving back as needed (greedy)
\n     # matches a line-feed (newline) character (ASCII 10)

sed

$ sed -nEe 'bend
            :start  ;N;/^--\nare/!b
            :loop   ;/^--$/!{p;n;bloop}
            :end    ;/^--$/bstart'           infile
Related Question