Bash – Unix file with newlines within quotes

bashcsvtext processing

I've got a CSV weird file with quotes within quotes and newlines and what not in one single column. Now I need to identify that column with "newlines" as one column and replace newlines with some delimiter.

I have 3 columns, the 3rd column will have some HTML text with all double quotes and each and every special character. But the double quotes are escaped with double quotes, like "<This ""is"" string>".

Input:

ID, Name, text

"1","abc","Line 1"
"2","def","Line2
""line2"",line2"
"3","ghi","line3"

Output:

ID, Name, text
"1","abc","Line 1"
"2","def","Line2 ""line2"",line2"
"3","ghi","line3"

Best Answer

There's no real issue with your file. It has embedded newlines and double quotes. A CSV parser would be able to handle it properly. Escaping double quotes with " (while double quoting the field) is the proper way to escape embedded double quotes in a CSV file.

To replace the embedded newlines in your CSV file with a @ character, you could do this:

$ csvformat -M '@' file.csv | tr '\n@' '@\n'
1,abc,Line 1
2,def,"Line2@""line2"",line2"
3,ghi,line3

This uses csvformat from the csvkit toolbox. It's a proper CSV parser that is able to reformat CSV files.

The command pipeline above first replaces all newlines that are not embedded with the @ character. Then I use tr to swap the remaining newlines and the @ characters with each other, ending up with a CSV file whose embedded newlines are @.

This relies on the fact that the original data in the file contains no @ characters.

If you then want to have spaces instead of a marker of where the newlines originally were, then use tr '\n@' ' \n' instead of the tr shown above:

$ csvformat -M '@' file.csv | tr '\n@' ' \n'
1,abc,Line 1
2,def,"Line2 ""line2"",line2"
3,ghi,line3

Note that this would make it extremely difficult to re-insert the original newlines if there are other spaces in the data (as there is in the third field on the first line).

Would you prefer that csvformat did not remove all unnecessary double quotes, then use it with -U 1:

$ csvformat -U 1 -M '@' file.csv | tr '\n@' ' \n'
"1","abc","Line 1"
"2","def","Line2 ""line2"",line2"
"3","ghi","line3"

Related Solutions

Linux – How to Prefix a column values with an apostrophe ( ‘ )

sed:

sed '2,$s/^\("[^"]*","\)/\1'"'"/ test.in

Using EREs to get rid of some of the escaping:

sed -E '2,$s/^("[^"]*",")/\1'"'"/ test.in

awk:

awk -F, 'NR>1{sub(/^"/,"\"'"'"'",$2)}1' test.in

If you don't want to worry about the quoting, use the escape code:

awk -F, '{sub(/^"/,"\"\x27",$2)}1' test.in

How to merge first two lines of a csv column-by-column

Try this

$ awk -F, 'NR<2{split(gensub(/Citty/,"City","g",$0),a,FS)}NR==2{for(b=2;b<=NF;b+=2){c=c a[b]" "$b","}print gensub(/,$/,"",1,c)}NR>2{print gensub(/(^,|" *",)/,"","g",$0)}' inp
Product Name,City Location,Price Per Unit
banana,CA,5.7
apple,FL,2.3
$

Same code is more readable if split across a few lines :

$ awk -F, '
> NR<2{split(gensub(/Citty/,"City","g",$0),a,FS)}
> NR==2{for(b=2;b<=NF;b+=2){c=c a[b]" "$b","}print gensub(/,$/,"",1,c)}
> NR>2{print gensub(/(^,|" *",)/,"","g",$0)}' inp
Product Name,City Location,Price Per Unit
banana,CA,5.7
apple,FL,2.3
$

If 1st line, split the line into array elements within a. Fix the Citty->City typo.

If 2nd line, starting with the 2nd column, print the corresponding column from 1st line together with this column. Repeat for each column, going in 2 column increments. Strip the trailing ,.

After 2nd line, replace any leading , or any "<spaces>", with an empty string and then print the result.

Tested ok on GNU Awk 4.0.2

Try it online!

Best Answer

Related Solutions

Linux – How to Prefix a column values with an apostrophe ( ‘ )

How to merge first two lines of a csv column-by-column

Related Question