edit: Now that I got answers, I marked one by @KamilMaciorowski that fits better to the title as an answer, but this answer by @oliv actually were better suited to my actual need to my primary purpose. (To process csv file with breaks consistently on awk.)
So in the case if you were looking for awking in the similar circumstance, I recommend checking that first!
Please help me preparing a few thousands of csv file ready for awk
to process! Some of the field has line breaks inside the field and that's causing awk
to process them as a multiple records.
However those problematic line breaks only happens where ^M is inserted, so I just need to remove ^M and line-break altogether from all of them.
*These ^M
s are indeed line break character, not literal caret & letter M string. This file is generated for .net to parse and process, but I haven't worked on developing apps on neither file producing/reading sides, so I don't really know how it's successfully parsed. It's exclusively used for fields in certain columns with multiple-lined strings (comments).
So how do you make this (csv with 1 header and 2 records. Some field has line breaks in it preceeded by ^M):
"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a^M
2-2_b^M
2-2_c", "2-3"
like this? (csv with 1 header and 2 records without line breaks within each of them.):
"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a2-2_b2-2_c", "2-3"
I tried removing them with sed
but I heard there's no way to process, and I didn't quite get the reason why.
for file in *.csv; do
sed -e "s/^M//" $file > sedded/$file;
done
Anyhow, I get this:
"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a
2-2_b
2-2_c", "2-3"
I tried to go for something like "s/^M\n/"
, and it doesn't work as I suspected. Should I use completely different tool like vim
? As long as it works for thousands of files at once (each containing ~500 lines, and I don't really care the time it takes to process) I'm fine with any sort of resolution. Just thought sed
was the way. (I'm ok to use DOS command/powershell if it's easier or more straight forward!)
Best Answer
If these
^M
-s are indeed line break characters, not literal caret & letter M strings, then they are what we denote\r
,CR
or0x0d
(compare this answer of mine, the beginning of it).Your command
doesn't remove
\r
; it doesn't even remove literal^M
. The command means "take a line, search for a letterM
that is at the very beginning of the line (^
, see this), replace it with nothing.Note
sed
understands\r
. Stillsed -e 's/\r//'
is not exactly what you need. It removes\r
but you need to remove the following\n
as well. You may want to trysed -e 's/\r\n//'
, this will also fail. The problem issed
is a text tool and it treats\n
as a separator. Excerpt frominfo sed
(emphasis mine):This means normally
\n
doesn't belong to any string processed withs/…
(or anothersed
command). For this reason concatenating few lines is not easy. Still it can be done. This is the command you need:Explanation:
: start
is a label.\r
(i.e.^M
,0x0d
character) at the very end ($
), execute the{}
block which is:\r
at the very end with nothing,N
),\n
that separates the additional line from the previous data.\r
at the very end (meaning the additional line brought it, so we need to add yet another line), jump tostart
.