Bash – Unix file with newlines within quotes

bashcsvtext processing

I've got a CSV weird file with quotes within quotes and newlines and what not in one single column. Now I need to identify that column with "newlines" as one column and replace newlines with some delimiter.

I have 3 columns, the 3rd column will have some HTML text with all double quotes and each and every special character. But the double quotes are escaped with double quotes, like "<This ""is"" string>".

Input:

ID, Name, text

"1","abc","Line 1"
"2","def","Line2
""line2"",line2"
"3","ghi","line3"

Output:

ID, Name, text
"1","abc","Line 1"
"2","def","Line2 ""line2"",line2"
"3","ghi","line3"

Best Answer

There's no real issue with your file. It has embedded newlines and double quotes. A CSV parser would be able to handle it properly. Escaping double quotes with " (while double quoting the field) is the proper way to escape embedded double quotes in a CSV file.

To replace the embedded newlines in your CSV file with a @ character, you could do this:

$ csvformat -M '@' file.csv | tr '\n@' '@\n'
1,abc,Line 1
2,def,"Line2@""line2"",line2"
3,ghi,line3

This uses csvformat from the csvkit toolbox. It's a proper CSV parser that is able to reformat CSV files.

The command pipeline above first replaces all newlines that are not embedded with the @ character. Then I use tr to swap the remaining newlines and the @ characters with each other, ending up with a CSV file whose embedded newlines are @.

This relies on the fact that the original data in the file contains no @ characters.

If you then want to have spaces instead of a marker of where the newlines originally were, then use tr '\n@' ' \n' instead of the tr shown above:

$ csvformat -M '@' file.csv | tr '\n@' ' \n'
1,abc,Line 1
2,def,"Line2 ""line2"",line2"
3,ghi,line3

Note that this would make it extremely difficult to re-insert the original newlines if there are other spaces in the data (as there is in the third field on the first line).

Would you prefer that csvformat did not remove all unnecessary double quotes, then use it with -U 1:

$ csvformat -U 1 -M '@' file.csv | tr '\n@' ' \n'
"1","abc","Line 1"
"2","def","Line2 ""line2"",line2"
"3","ghi","line3"
Related Question