CSV Text Processing – How to Remove Commas Within Double-Quoted Numbers

csvsedtext processing

In a text file, I want to remove , (commas) and also the " (quotes) (only if the double quotes contains numbers separated by commas).

56,72,"12,34,54",x,y,"foo,a,b,bar"

Expected ouput

56,72,123454,x,y,"foo,a,b,bar"

Note: I show the above line just as an example. My text file contains many lines like above and the numbers separated by commas present within the double quotes should vary. That is,

56,72,"12,34,54",x,y,"foo,a,b,bar"
56,92,"12,34",x,y,"foo,a,b,bar"
56,72,"12,34,54,78,76,54,67",x,y,"foo,a,b,bar"
56,72,x,y,"foo,a,b,bar","12,34,54"
56,72,x,y,"foo,a,b,bar","12,34,54","45,57,84,92","bar,foo"

Expected output:

56,72,123454,x,y,"foo,a,b,bar"
56,92,1234,x,y,"foo,a,b,bar"
56,72,12345478765467,x,y,"foo,a,b,bar"
56,72,x,y,"foo,a,b,bar",123454
56,72,x,y,"foo,a,b,bar",123454,45578492,"bar,foo"

There a n number of numbers present within the double quotes separated by commas. And also leave the double quotes which contains characters as it is.

I love sed text processing tool. I'm happy if you post any sed solution for this.

Best Answer

This (adapted from here) should do what you need though @rici's Perl one is much simpler:

$ sed -r ':a;s/(("[0-9,]*",?)*"[0-9,]*),/\1/;ta; s/""/","/g; 
          s/"([0-9]*)",?/\1,/g ' file
56,72,123454,x,y,"foo,a,b,bar"
56,92,1234,x,y,"foo,a,b,bar"
56,72,12345478765467,x,y,"foo,a,b,bar"
56,72,x,y,"foo,a,b,bar",123454,
56,72,x,y,"foo,a,b,bar",123454,45578492,"bar,foo"

Explanation

  • :a : define a label called a.
  • s/(("[0-9,]*",?)*"[0-9,]*),/\1/ : This one needs to be broken down
    • First of all, using this construct : (foo(bar)), \1 will be foobar and \2 will be bar.
    • "[0-9,]*",? : match 0 or more of 0-9 or ,, followed by 0 or 1 ,.
    • ("[0-9,]*",?)* : match 0 or more of the above.
    • "[0-9,]* : match 0 or more of 0-9 or , that come right after a "
  • ta; : go back to the label a and run again if the substitution was successful.
  • s/""/","/g; : post-processing. Replace "" with ",".
  • s/"([0-9]*)",?/\1,/g : remove all quotes around numbers.

This might be easier to understand with another example:

$ echo '"1,2,3,4"' | sed -nr ':a;s/(("[0-9,]*",?)*"[0-9,]*),/\1/;p;ta;'
"1,2,34"
"1,234"
"1234"
"1234"

So, while you can find a number that is right after a quote and followed by a comma and another number, join the two numbers together and repeat the process until it is no longer possible.

At this point I believe it is useful to mention a quote from info sed that appears in the section describing advanced functions such as the label used above (thanks for finding if @Braiam):

In most cases, use of these commands indicates that you are probably better off programming in something like `awk' or Perl.

Related Question