String replacement in file

sedtext processing

I have the following file:

<?xml version="1.0" encoding="utf-8"?>
<!--Generated by crowdin.net-->
  <string name="test" >- test</string>
  <string name="test" >test-test</string>
  <string name="test" >test - test</string>

and I would like to replace the en dash with its unicode value, but not all of them, just the one in the string tag

I run several sed with different regex, but I couldn't figured it out. One of those was

sed -i.bak "s/-[^-\<\>0-9]/\&#8211\;/g" strings.xml

the output was:

<?xml version="1.0" encoding="utf-8"?>
<!-&#8211;enerated by-->
  <string name="test" >&#8211;test</string>
  <string name="test2" >test&#8211;est</string>
  <string name="test3" >test &#8211;test</string>

my problem is that is also replacing empty spaces and the first char of the second word. I have not that big experience with regex and sed. Could you please explain me what I am doing wrong?

Note: I'm using OSX.

Best Answer

With a recent (for \K and s///r) perl and assuming your <string> tags don't nest:

perl -0777 -pi.bak -e's{<string.*?>\K.*?(?=</string>)}{$&=~s/-/&#8211;/rg}ges' file.xml
  • -0777: slurp mode: handle the whole file at once (to allow <string> tags to span several lines).
  • -p: sed mode
  • -i.bak: in-place editing with .bak extension (BTW, that's where some sed implementations got that idea from)
  • s{...}{...}ges: substitute globally (g), where . matches newline characters as well (s), and treat the replacement as perl code to execute (e).
  • <string.*?>\K.*?</string>: match from <string...> to </string> but don't include the tags themselves in the part that is matched (\K defines where the matched portion starts, and (?=...) is a look-ahead operator that only checks if </string> is there, but doesn't include it in the match).
  • $&=~s/.../.../rg. Do the substitution on the matched part ($&). The r flag is to actually not modify $& but return the substituted string.
Related Question