Grouped sorting of continuous paragraphs (separated by blank line)

sorttext processing

I think I'm pretty experienced now in sorting by columns; however, I haven't found anything so far how to sort continuous rows.

Supposing we have a text file that looks like this: (very simplified, of course)

Echo
Alpha
Delta
Charlie

Golf
Bravo
Hotel
Foxtrot

Now, is it possible to sort the lines alphanumerically per each block separately?
I mean, so that the result looks like this:

Alpha
Charlie
Delta
Echo

Bravo
Foxtrot
Golf
Hotel

Telling from what I found in the sort man page, this might not be possible with the built-in UNIX sortcommand. Or can it even be done without having to resort to external/third-party tools?

Best Answer

Drav's awk solution is good, but that means running one sort command per paragraph. To avoid that, you could do:

< file awk -v n=0 '!NF{n++};{print n,$0}' | sort -k1n -k2 | cut -d' ' -f2-

Or you could do the whole thing in perl:

perl -ne 'if (/\S/){push@l,$_}else{print sort@l if@l;@l=();print}
          END{print sort @l if @l}' < file

Note that above, separators are blank lines (for the awk one, lines with only space or tab characters, for the perl one, any horizontal or vertical spacing character) instead of empty lines. If you do want empty lines, you can replace !NF with !length or $0=="", and /\S/ with /./.

AWK

Using GNU awk or mawk:

$ awk '$1~"^"word{printf("--\n%s",$0)}' word='are' RS='--\n' infile
--
are you happy
--
are(you hungry
too

This sets the variable word to the word to match at the beginning of the record and RS (record separator) to '--' followed by a new line \n. Then, for any record which starts with the word to match ($1~"^"word) print a formatted record. The format is a starting '--' with a new line with the exact record found.

GREP

Using (GNU for the -z option) grep:

grep -Pz -- '--\nare(?:[^\n]*\n)+?(?=--|\Z)' infile
grep -Pz -- '(?s)--\nare.*?(?=\n--|\Z)\n' infile
grep -Pz -- '(?s)--\nare(?:(?!\n--).)*\n' infile

Description(s) For the following descriptions, the PCRE option (?x) is used to add (a lot) of explaining comments (and spaces) inline with the actual (working) regex. If the comments (and most spaces) (up to the next newline) are removed, the resulting string is still the same regex. This allow the description of the regex in detail in working code. This makes code maintenance a lot easier.

Option 1 regex `(?x)--\nare(?:[^\n]*\n)+?(?=--|\Z)`

(?x)   # match the remainder of the pattern with the following
       # effective flags: x
       #      x modifier: extended. Spaces and text after a # 
       #      in the pattern are ignored
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
(?:    #      Non-Capturing Group (?:[^\n]*\n)+?
[^\n]  #           matches non-newline characters
*      #           Quantifier — Matches between zero and unlimited times, as
       #           many times as possible, giving back as needed (greedy)
\n     #           matches a line-feed (newline) character (ASCII 10)
)      #      Close the Non-Capturing Group
+?     # Quantifier — Matches between one and unlimited times, as
       # few times as possible, expanding as needed (lazy)
       # A repeated capturing group will only capture the last iteration.
       # Put a capturing group around the repeated group to capture all
       # iterations or use a non-capturing group instead if you're not
       # interested in the data
(?=    # Positive Lookahead (?=--|\Z)
       # Assert that the Regex below matches
       #      1st Alternative --
--     #           matches the characters -- literally (case sensitive)
|      #      2nd Alternative \Z
\Z     #           \Z asserts position at the end of the string, or before
       #           the line terminator right at the end of the 
       #           string (if any)
)      #      Closing the lookahead.

Option 2 regex `(?sx)--\nare.*?(?=\n--|\Z)\n`

(?sx)  # match the remainder of the pattern with the following eff. flags: sx
       #        s modifier: single line. Dot matches newline characters
       #        x modifier: extended. Spaces and text after a # in 
       #        the pattern are ignored
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
.*?    # matches any character 
       #        Quantifier — Matches between zero and unlimited times,
       #        as few times as possible, expanding as needed (lazy).
(?=    # Positive Lookahead (?=\n--|\Z)
       # Assert that the Regex below matches
       #        1st Alternative \n--
\n     #               matches a line-feed (newline) character (ASCII 10)
--     #               matches the characters -- literally.
|      #        2nd Alternative \Z
\Z     #               \Z asserts position at the end of the string, or
       #               before the line terminator right at
       #               the end of the string (if any)
)      # Close the lookahead parenthesis.
\n     #        matches a line-feed (newline) character (ASCII 10)

Option 3 regex `(?xs)--\nare(?:(?!\n--).)*\n`

(?xs)  # match the remainder of the pattern with the following eff. flags: xs
       # modifier x : extended. Spaces and text after a # in are ignored
       # modifier s : single line. Dot matches newline characters
--     # matches the characters -- literally (case sensitive)
\n     # matches a line-feed (newline) character (ASCII 10)
are    # matches the characters are literally (case sensitive)
(?:    # Non-capturing group (?:(?!\n--).)
(?!    #      Negative Lookahead (?!\n--)
       #           Assert that the Regex below does not match
\n     #                matches a line-feed (newline) character (ASCII 10)
--     #                matches the characters -- literally
)      #      Close Negative lookahead
.      #      matches any character
)      # Close the Non-Capturing group.
*      # Quantifier — Matches between zero and unlimited times, as many
       # times as possible, giving back as needed (greedy)
\n     # matches a line-feed (newline) character (ASCII 10)

sed

$ sed -nEe 'bend
            :start  ;N;/^--\nare/!b
            :loop   ;/^--$/!{p;n;bloop}
            :end    ;/^--$/bstart'           infile

Best Answer

Related Solutions

Sort comma-separated fields on each line by numeric value

Filter separated paragraphs according to their first word

AWK

GREP

Option 1 regex (?x)--\nare(?:[^\n]*\n)+?(?=--|\Z)

Option 2 regex (?sx)--\nare.*?(?=\n--|\Z)\n

Option 3 regex (?xs)--\nare(?:(?!\n--).)*\n

sed

Related Question

Option 1 regex `(?x)--\nare(?:[^\n]*\n)+?(?=--|\Z)`

Option 2 regex `(?sx)--\nare.*?(?=\n--|\Z)\n`

Option 3 regex `(?xs)--\nare(?:(?!\n--).)*\n`