Remove double quotes within the double-quoted field values in .dat file

awkcsvsedtext processing

I have a text file that has around 15 columns.
The fields are separated by comma.
One column that is description is double-quoted
and also has some words which are double-quoted.
I need to retain the beginning and ending double quotes
and remove only the inner double quotes.

Something like this:

"Hi there, we are from XYZ team, we have an "Opportunity" at our organization"

I need output as:

"Hi there, we are from XYZ team, we have an Opportunity at our organization"

I don't want to go for Python programming.
I was looking for an awk command or any other best option.

The file might have 100 lines of data but this description column has double-quoted word for few lines and not for all 100 lines.

Here is some sample data:

invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number

1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,279.6,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

I have to remove double quotes for "Material Included" in the line description.

Please Note: I need entire file and retain all the columns, but just remove inner double quotes in the line description value. Only the line description field has such inner double quoted values.
As for now, there is only one inner double quoted word that is coming up for line description the file, we haven't noticed more than one.

Best Answer

Note: I'm not using the provided data from the question as the number of header fields does not seem to match up with the number of data fields. Instead I use printf to create a simple data set with the same quoting issue as described in the question.

Using Miller (mlr) as shown below, you will be able to convert the problematic embedded double quotes into properly CSV-encoded embedded double quotes. This includes doubling up each embedded double-quote character:

$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"'
a,b,c
aaa,"bb "bb" bb","c"cc"
$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"' | mlr --csv --lazy-quotes cat
a,b,c
aaa,"bb ""bb"" bb","c""cc"

This would create a CSV document that any CSV-aware parser would be able to read correctly, preserving the embedded quotes.

To completely remove the embedded double quotes, you may use Miller like so:

$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"' | mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'
a,b,c
aaa,bb bb bb,ccc

This uses mlr to iterate over all fields in all records and to remove any double-quote character found.

If a field needs quoting due to containing a comma, then Miller will quote it:

$ printf '%s\n' a,b,c 'aaa,"b,b "bb" bb","c"cc"' | mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'
a,b,c
aaa,"b,b bb bb",ccc

The Miller command again, but by itself:

mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'

If you know the name of the field that contains the quotes that you wish to remove, e.g. line description, then you may simplify the command and remove the loop:

mlr --csv --lazy-quotes put '$["line description"] = gssub($["line description"], "\"", "")'

Explanation

:a : define a label called a.
s/(("[0-9,]*",?)*"[0-9,]*),/\1/ : This one needs to be broken down
- First of all, using this construct : (foo(bar)), \1 will be foobar and \2 will be bar.
- "[0-9,]*",? : match 0 or more of 0-9 or ,, followed by 0 or 1 ,.
- ("[0-9,]*",?)* : match 0 or more of the above.
- "[0-9,]* : match 0 or more of 0-9 or , that come right after a "
ta; : go back to the label a and run again if the substitution was successful.
s/""/","/g; : post-processing. Replace "" with ",".
s/"([0-9]*)",?/\1,/g : remove all quotes around numbers.

This might be easier to understand with another example:

$ echo '"1,2,3,4"' | sed -nr ':a;s/(("[0-9,]*",?)*"[0-9,]*),/\1/;p;ta;'
"1,2,34"
"1,234"
"1234"
"1234"

So, while you can find a number that is right after a quote and followed by a comma and another number, join the two numbers together and repeat the process until it is no longer possible.

At this point I believe it is useful to mention a quote from info sed that appears in the section describing advanced functions such as the label used above (thanks for finding if @Braiam):

In most cases, use of these commands indicates that you are probably better off programming in something like `awk' or Perl.

Best Answer

Related Solutions

Text Processing – How to Remove Commas Embedded Within Quotes in a CSV File

CSV Text Processing – How to Remove Commas Within Double-Quoted Numbers

Explanation

Related Question