Remove double quotes within the double-quoted field values in .dat file

awkcsvsedtext processing

I have a text file that has around 15 columns. 
The fields are separated by comma. 
One column that is description is double-quoted
and also has some words which are double-quoted. 
I need to retain the beginning and ending double quotes
and remove only the inner double quotes.

Something like this:

"Hi there, we are from XYZ team, we have an "Opportunity" at our organization"

I need output as:

"Hi there, we are from XYZ team, we have an Opportunity at our organization"

I don't want to go for Python programming. 
I was looking for an awk command or any other best option.

The file might have 100 lines of data but this description column has double-quoted word for few lines and not for all 100 lines.

Here is some sample data:

invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number

1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,279.6,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

I have to remove double quotes for "Material Included" in the line description.

Please Note: I need entire file and retain all the columns, but just remove inner double quotes in the line description value. Only the line description field has such inner double quoted values.
As for now, there is only one inner double quoted word that is coming up for line description the file, we haven't noticed more than one.

Best Answer

Note: I'm not using the provided data from the question as the number of header fields does not seem to match up with the number of data fields. Instead I use printf to create a simple data set with the same quoting issue as described in the question.

Using Miller (mlr) as shown below, you will be able to convert the problematic embedded double quotes into properly CSV-encoded embedded double quotes. This includes doubling up each embedded double-quote character:

$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"'
aaa,"bb "bb" bb","c"cc"
$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"' | mlr --csv --lazy-quotes cat
aaa,"bb ""bb"" bb","c""cc"

This would create a CSV document that any CSV-aware parser would be able to read correctly, preserving the embedded quotes.

To completely remove the embedded double quotes, you may use Miller like so:

$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"' | mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'
aaa,bb bb bb,ccc

This uses mlr to iterate over all fields in all records and to remove any double-quote character found.

If a field needs quoting due to containing a comma, then Miller will quote it:

$ printf '%s\n' a,b,c 'aaa,"b,b "bb" bb","c"cc"' | mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'
aaa,"b,b bb bb",ccc

The Miller command again, but by itself:

mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'

If you know the name of the field that contains the quotes that you wish to remove, e.g. line description, then you may simplify the command and remove the loop:

mlr --csv --lazy-quotes put '$["line description"] = gssub($["line description"], "\"", "")'
Related Question