Filter separated paragraphs according to their first word

sedtext processing

I have a program which prints out lines of text ("paragraphs") separated by '–'. For example it might print

--
are you happy
--
I am hungry
are you
--
are(you hungry
too

I want to pipe that into another program (sed maybe?) and get back just the paragraphs that start with a given word (e.g. "are"). So in the above case getting paragraphs that begin with "are" back I'd get

--
are you happy
--
are(you hungry
too

The program prints out a potentially very large number of "paragraphs" but I expect only a small number to match, which is why I would prefer to be able to filter the program's output in a streaming way (avoiding writing everything to a huge file and then filtering it).

Best Answer

AWK

Using GNU awk or mawk:

$ awk '$1~"^"word{printf("--\n%s",$0)}' word='are' RS='--\n' infile
--
are you happy
--
are(you hungry
too

This sets the variable word to the word to match at the beginning of the record and RS (record separator) to '--' followed by a new line \n. Then, for any record which starts with the word to match ($1~"^"word) print a formatted record. The format is a starting '--' with a new line with the exact record found.

GREP

Using (GNU for the -z option) grep:

grep -Pz -- '--\nare(?:[^\n]*\n)+?(?=--|\Z)' infile
grep -Pz -- '(?s)--\nare.*?(?=\n--|\Z)\n' infile
grep -Pz -- '(?s)--\nare(?:(?!\n--).)*\n' infile

Description(s) For the following descriptions, the PCRE option (?x) is used to add (a lot) of explaining comments (and spaces) inline with the actual (working) regex. If the comments (and most spaces) (up to the next newline) are removed, the resulting string is still the same regex. This allow the description of the regex in detail in working code. This makes code maintenance a lot easier.

Option 1 regex `(?x)--\nare(?:[^\n]*\n)+?(?=--|\Z)`

(?x)   # match the remainder of the pattern with the following
       # effective flags: x
       #      x modifier: extended. Spaces and text after a # 
       #      in the pattern are ignored
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
(?:    #      Non-Capturing Group (?:[^\n]*\n)+?
[^\n]  #           matches non-newline characters
*      #           Quantifier — Matches between zero and unlimited times, as
       #           many times as possible, giving back as needed (greedy)
\n     #           matches a line-feed (newline) character (ASCII 10)
)      #      Close the Non-Capturing Group
+?     # Quantifier — Matches between one and unlimited times, as
       # few times as possible, expanding as needed (lazy)
       # A repeated capturing group will only capture the last iteration.
       # Put a capturing group around the repeated group to capture all
       # iterations or use a non-capturing group instead if you're not
       # interested in the data
(?=    # Positive Lookahead (?=--|\Z)
       # Assert that the Regex below matches
       #      1st Alternative --
--     #           matches the characters -- literally (case sensitive)
|      #      2nd Alternative \Z
\Z     #           \Z asserts position at the end of the string, or before
       #           the line terminator right at the end of the 
       #           string (if any)
)      #      Closing the lookahead.

Option 2 regex `(?sx)--\nare.*?(?=\n--|\Z)\n`

(?sx)  # match the remainder of the pattern with the following eff. flags: sx
       #        s modifier: single line. Dot matches newline characters
       #        x modifier: extended. Spaces and text after a # in 
       #        the pattern are ignored
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
.*?    # matches any character 
       #        Quantifier — Matches between zero and unlimited times,
       #        as few times as possible, expanding as needed (lazy).
(?=    # Positive Lookahead (?=\n--|\Z)
       # Assert that the Regex below matches
       #        1st Alternative \n--
\n     #               matches a line-feed (newline) character (ASCII 10)
--     #               matches the characters -- literally.
|      #        2nd Alternative \Z
\Z     #               \Z asserts position at the end of the string, or
       #               before the line terminator right at
       #               the end of the string (if any)
)      # Close the lookahead parenthesis.
\n     #        matches a line-feed (newline) character (ASCII 10)

Option 3 regex `(?xs)--\nare(?:(?!\n--).)*\n`

(?xs)  # match the remainder of the pattern with the following eff. flags: xs
       # modifier x : extended. Spaces and text after a # in are ignored
       # modifier s : single line. Dot matches newline characters
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
(?:    # Non-capturing group (?:(?!\n--).)
(?!    #      Negative Lookahead (?!\n--)
       #           Assert that the Regex below does not match
\n     #                matches a line-feed (newline) character (ASCII 10)
--     #                matches the characters -- literally
)      #      Close Negative lookahead
.      #      matches any character
)      # Close the Non-Capturing group.
*      # Quantifier — Matches between zero and unlimited times, as many
       # times as possible, giving back as needed (greedy)
\n     # matches a line-feed (newline) character (ASCII 10)

sed

$ sed -nEe 'bend
            :start  ;N;/^--\nare/!b
            :loop   ;/^--$/!{p;n;bloop}
            :end    ;/^--$/bstart'           infile

Related Solutions

Grouped sorting of continuous paragraphs (separated by blank line)

Drav's awk solution is good, but that means running one sort command per paragraph. To avoid that, you could do:

< file awk -v n=0 '!NF{n++};{print n,$0}' | sort -k1n -k2 | cut -d' ' -f2-

Or you could do the whole thing in perl:

perl -ne 'if (/\S/){push@l,$_}else{print sort@l if@l;@l=();print}
          END{print sort @l if @l}' < file

Note that above, separators are blank lines (for the awk one, lines with only space or tab characters, for the perl one, any horizontal or vertical spacing character) instead of empty lines. If you do want empty lines, you can replace !NF with !length or $0=="", and /\S/ with /./.

Sed – Modify Every Non-First Word Repetition in Text

If your input doesn't contain <, > nor + characters, you could do:

sed '
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g'

If it may, you can always escape them:

sed '
  s/:/::/g;s/</:{/g;s/>/:}/g
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g
  s/:}/>/g;s/:{/</g;s/::/:/g'

Those assume you want to do that independently on each line. If you want to do it on the whole file, you'd need to load the whole file in memory first (note that some sed implementations have size limitations there):

sed '
  :2
  $!{N;b2
  }
  s/:/::/g;s/</:{/g;s/>/:}/g
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g
  s/:}/>/g;s/:{/</g;s/::/:/g'

That's going to be pretty inefficient though and would be a lot easier with perl:

perl -pe 's/\w+/$seen{$&}++ ? "+$&+" : $&/ge'

Line-based:

perl -pe 'my %seen;s/\w+/$seen{$&}++ ? "+$&+" : $&/ge'

Best Answer

AWK

GREP

Option 1 regex (?x)--\nare(?:[^\n]*\n)+?(?=--|\Z)

Option 2 regex (?sx)--\nare.*?(?=\n--|\Z)\n

Option 3 regex (?xs)--\nare(?:(?!\n--).)*\n

sed