I have a text file that has around 15 columns.
The fields are separated by comma.
One column that is description is double-quoted
and also has some words which are double-quoted.
I need to retain the beginning and ending double quotes
and remove only the inner double quotes.
Something like this:
"Hi there, we are from XYZ team, we have an "Opportunity" at our organization"
I need output as:
"Hi there, we are from XYZ team, we have an Opportunity at our organization"
I don't want to go for Python programming.
I was looking for an awk command or any other best option.
The file might have 100 lines of data but this description column has double-quoted word for few lines and not for all 100 lines.
Here is some sample data:
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,279.6,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included" and artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10
I have to remove double quotes for "Material Included" in the line description.
Please Note: I need entire file and retain all the columns, but just remove inner double quotes in the line description value. Only the line description field has such inner double quoted values.
As for now, there is only one inner double quoted word that is coming up for line description the file, we haven't noticed more than one.
Best Answer
Note: I'm not using the provided data from the question as the number of header fields does not seem to match up with the number of data fields. Instead I use
printf
to create a simple data set with the same quoting issue as described in the question.Using Miller (
mlr
) as shown below, you will be able to convert the problematic embedded double quotes into properly CSV-encoded embedded double quotes. This includes doubling up each embedded double-quote character:This would create a CSV document that any CSV-aware parser would be able to read correctly, preserving the embedded quotes.
To completely remove the embedded double quotes, you may use Miller like so:
This uses
mlr
to iterate over all fields in all records and to remove any double-quote character found.If a field needs quoting due to containing a comma, then Miller will quote it:
The Miller command again, but by itself:
If you know the name of the field that contains the quotes that you wish to remove, e.g.
line description
, then you may simplify the command and remove the loop: