Text Processing – Concatenate Lines Based on First Character of Next Line

awkperlsedtext processing

I am looking for away contact lines based on the next line. So far the only way I see is to create a shell script that will read line by line and will do something along these lines:

while read line
    if $line does not start with "," and $curr_line is empty 
        store line in curr_line
    if $line does not start with "," and $curr_line is not empty
        flush $curr_line to file
        store $line in $curr_line
    if $line starts with "," append to $curr_file, flush to file empty curr_line
done < file

So I am trying to understand if could be achieved with sed or even grep with redirection.
the rules of the file are simple.
There is at max one and only one line starting with "," that needs to be appended to the previous line.

ex:

line0
line1
line2
,line3
line4
line5
,line6
line7
,line8
line9
line10
line11

The result file would be

line0
line1
line2,line3
line4
line5,line6
line7,line8
line9
line10
line11

Best Answer

I'd do:

awk -v ORS= '
  NR>1 && !/,/ {print "\n"}
  {print}
  END {if (NR) print "\n"}' < file

That is, only prints that newline character that delimits the previous line if the current one does not start with a ,.

In any case, I wouldn't use a while read loop.

Related Solutions

Merge Next Line with previous line

Keep it simple:

sed 'H;1h;$!d;g;s/\n  */ /g'

This short script will join all lines that begin with at least one space with the previous line.

How it works: H appends each line to the hold space. To avoid a leading newline the first line is copied by 1h. If this was not the last line, delete it, otherwise move the hold space to the pattern space with g. Now the whole file is in the pattern space and now the s command replaces all newlines with spaces by one space.

With GNU sed you can make it even simpler:

sed -z 's/\n  */ /g'

Bash – Elegant Way to Merge Lines with Multi-Char Delimiter

The elegance may come from the correct regex. Instead of changing every \r to a \n (s/\r/\n/g) you can convert every line terminator \r\n, \r, \n to the delimiter you want (in GNU sed, as few sed implementations will understand \r, and not all will understand -E):

sed -E 's/\r\n|\r|\n/; /g'

Or, if you want to remove empty lines, any run of such line terminators:

sed -E 's/[\r\n]+/; /g'

That will work if we are able to capture all line terminators in the pattern space. That means to slurp the whole file into memory to be able to edit them.

So, you can use the simpler (one command for GNU sed):

sed -zE 's/[\r\n]+/; /g; s/; $/\n/' "$filepathvar"

The -z takes null bytes as line terminators effectively getting all \r and \n in the pattern space.

The s/[\r\n]+/; /g converts all types of line delimiters to the string you want.

The s/; $/\n/ converts the (last) trailing delimiter to an actual newline.

Notes

The -z sed option means to use the zero delimiter (0x00). The use of that delimiter started as a need of find to be able to process filenames with newlines (-print0) which will match the xargs (-0) option. That meant that some tools were also modified to process zero delimited strings.

That is a non-posix option that breaks files at zeros instead of newlines.

Posix text files must have no zero (NIL) bytes, so the use of that option means, in practice, to capture the whole file in memory before processing it.

Breaking files on NILs means that newline characters end being editable on the pattern space of sed. If the file happens to have some NIL bytes, the idea still works correctly for newlines, as they still end being editable in each chunk of the file.

The -z option was added to GNU sed. The ATT sed (on which posix was based) did not have such option (and still doesn't), some BSD seds also still don't.

An alternative to the -z option is to capture the whole file in memory. That could be done Posixly in some ways:

sed 'H;1h;$!d'          # capture whole file in hold space.
sed ':a;N;$!ba'         # capture whole file in pattern space.

Having all newlines (except the last one) in the pattern space makes it possible to edit them:

sed -Ee 'H;1h;$!d;x'   -e 's/(\r\n|\r|\n)/; /g

With older sed's it is also required to use the longer and more explicit (\r\n|\r|\n)+ instead of [\r\n]+ because such sed's don't understand \r or \n inside bracket expressions [].

Line oriented

A solution that works one line at a time (a \r is also a valid line terminator in this solution), which means that there is no need to keep the whole file in memory (less memory used) is possible with GNU awk:

awk -vRS='[\r\n]+' 'NR>1{printf "; "}{printf $0}END{print ""}'  file

Must be GNU awk because of the regex record separator [\r\n]+. In other awk, the record separator must be a single byte.