Substitute every comma outside of double quotes for a pipe

csvregular expressionsed

Using sed, I'd like to substitute every comma that is outside of double quotes for a pipe.

So that this line in .csv file:

John,Tonny,"345.3435,23",56th Street

Would be converted to:

John|Tonny|"345.3435,23"|56th Street

Could you help me with the regex for that?

Best Answer

If your sed supports the -E option (-r in some implementations):

sed -Ee :1 -e 's/^(([^",]|"[^"]*")*),/\1|/;t1' < file

The

:label
   s/pattern/replacement/
t label

Is a very common sed idiom. It keeps doing the same substitution in a loop as long as it's successful.

Here, we're substituting the leading part of the line made of 0 or more quoted strings or characters other that " and , (captured in \1) followed by a , with that \1 capture and a |, so on your sample that means:

  • John,Tonny,"345.3435,23",56th Street -> John|Tonny,"345.3435,23",56th Street
  • John|Tonny,"345.3435,23",56th Street -> John|Tonny|"345.3435,23",56th Street
  • John|Tonny|"345.3435,23",56th Street -> John|Tonny|"345.3435,23"|56th Street
  • and we stop here as the pattern doesn't match any more on that.

With perl, you could do it with one substitution with the g flag with:

perl -pe 's{("[^"]*"|[^",]+)|,}{$1 // "|"}ge'

Here, assuming quotes are balanced in the input, the pattern would match all the input, breaking it up in either:

  • quoted string
  • sequences of characters other than , or "
  • a comma

And only when the matched string is a comma (when $1 is not defined in the replacement part), replace it with a |.

Related Question