Remove every line breaks following carriage return (^M) and join the lines

bashcsvsed

edit: Now that I got answers, I marked one by @KamilMaciorowski that fits better to the title as an answer, but this answer by @oliv actually were better suited to my actual need to my primary purpose. (To process csv file with breaks consistently on awk.)

So in the case if you were looking for awking in the similar circumstance, I recommend checking that first!


Please help me preparing a few thousands of csv file ready for awk to process! Some of the field has line breaks inside the field and that's causing awk to process them as a multiple records.
However those problematic line breaks only happens where ^M is inserted, so I just need to remove ^M and line-break altogether from all of them.

*These ^Ms are indeed line break character, not literal caret & letter M string. This file is generated for .net to parse and process, but I haven't worked on developing apps on neither file producing/reading sides, so I don't really know how it's successfully parsed. It's exclusively used for fields in certain columns with multiple-lined strings (comments).

So how do you make this (csv with 1 header and 2 records. Some field has line breaks in it preceeded by ^M):

"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a^M
2-2_b^M
2-2_c", "2-3"

like this? (csv with 1 header and 2 records without line breaks within each of them.):

"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a2-2_b2-2_c", "2-3"

I tried removing them with sed but I heard there's no way to process, and I didn't quite get the reason why.

for file in *.csv; do
    sed -e "s/^M//" $file > sedded/$file;
done

Anyhow, I get this:

"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a
2-2_b
2-2_c", "2-3"

I tried to go for something like "s/^M\n/", and it doesn't work as I suspected. Should I use completely different tool like vim? As long as it works for thousands of files at once (each containing ~500 lines, and I don't really care the time it takes to process) I'm fine with any sort of resolution. Just thought sed was the way. (I'm ok to use DOS command/powershell if it's easier or more straight forward!)

Best Answer

If these ^M-s are indeed line break characters, not literal caret & letter M strings, then they are what we denote \r, CR or 0x0d (compare this answer of mine, the beginning of it).

Your command

sed -e "s/^M//"

doesn't remove \r; it doesn't even remove literal ^M. The command means "take a line, search for a letter M that is at the very beginning of the line (^, see this), replace it with nothing.

Note sed understands \r. Still sed -e 's/\r//' is not exactly what you need. It removes \r but you need to remove the following \n as well. You may want to try sed -e 's/\r\n//', this will also fail. The problem is sed is a text tool and it treats \n as a separator. Excerpt from info sed (emphasis mine):

sed operates by performing the following cycle on each lines of input: first, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space. Then commands are executed; […].

This means normally \n doesn't belong to any string processed with s/… (or another sed command). For this reason concatenating few lines is not easy. Still it can be done. This is the command you need:

sed -e ': start; /\r$/{ s/\r$//; N; s/\n// }; /\r$/b start'

Explanation:

  • : start is a label.
  • If the line contains \r (i.e. ^M, 0x0d character) at the very end ($), execute the {} block which is:
    • replace \r at the very end with nothing,
    • append an additional line from the input (N),
    • replace \n that separates the additional line from the previous data.
  • If the outcome contains \r at the very end (meaning the additional line brought it, so we need to add yet another line), jump to start.
Related Question