String replacement in file

sedtext processing

I have the following file:

<?xml version="1.0" encoding="utf-8"?>
<!--Generated by crowdin.net-->
  <string name="test" >- test</string>
  <string name="test" >test-test</string>
  <string name="test" >test - test</string>

and I would like to replace the en dash with its unicode value, but not all of them, just the one in the string tag

I run several sed with different regex, but I couldn't figured it out. One of those was

sed -i.bak "s/-[^-\<\>0-9]/\&#8211\;/g" strings.xml

the output was:

<?xml version="1.0" encoding="utf-8"?>
<!-&#8211;enerated by-->
  <string name="test" >&#8211;test</string>
  <string name="test2" >test&#8211;est</string>
  <string name="test3" >test &#8211;test</string>

my problem is that is also replacing empty spaces and the first char of the second word. I have not that big experience with regex and sed. Could you please explain me what I am doing wrong?

Note: I'm using OSX.

Best Answer

With a recent (for \K and s///r) perl and assuming your <string> tags don't nest:

perl -0777 -pi.bak -e's{<string.*?>\K.*?(?=</string>)}{$&=~s/-/&#8211;/rg}ges' file.xml

-0777: slurp mode: handle the whole file at once (to allow <string> tags to span several lines).
-p: sed mode
-i.bak: in-place editing with .bak extension (BTW, that's where some sed implementations got that idea from)
s{...}{...}ges: substitute globally (g), where . matches newline characters as well (s), and treat the replacement as perl code to execute (e).
<string.*?>\K.*?</string>: match from <string...> to </string> but don't include the tags themselves in the part that is matched (\K defines where the matched portion starts, and (?=...) is a look-ahead operator that only checks if </string> is there, but doesn't include it in the match).
$&=~s/.../.../rg. Do the substitution on the matched part ($&). The r flag is to actually not modify $& but return the substituted string.

Related Solutions

Non-line-oriented tool for string replacement

The first thing that occurs to me when facing this type of problem is to change the record separator. In most tools, this is set to \n by default but that can be changed. For example:

Perl
```
perl -0x3E -pe 's/<foobar>/\n$&/' file
```
Explanation
- -0 : this sets the input record separator to a character given its hexadecimal value. In this case, I am setting it to > whose hex value is 3E. The general format is -0xHEX_VALUE. This is just a trick to break the line into manageable chunks.
- -pe : print each input line after applying the script given by -e.
- s/<foobar>/\n$&/ : a simple substitution. The $& is whatever was matched, in this case <foobar>.
awk
```
awk '{gsub(/foobar>/,"\n<foobar>");printf "%s",$0};' RS="<" file
```
Explanation
- RS="<" : set the input record separator to >.
- gsub(/foobar>/,"\n<foobar>") : substitute all cases of foobar> with \n<foobar>. Note that because RS has been set to <, all < are removed from the input file (that's how awk works) so we need to match foobar> (without a <) and replace with \n<foobar>.
- printf "%s",$0 : print the current "line" after the substitution. $0 is the current record in awk so it will hold whatever was before the <.

I tested these on a 2.3 GB, single-line file created with these commands:

for i in {1..900000}; do printf "blah blah <foobar>blah blah"; done > file
for i in {1..100}; do cat file >> file1; done
mv file1 file

Both the awk and the perl used negligible amounts of memory.

Text Processing Sed Awk Gawk – String Replacement Using a Dictionary

Here's one way with sed:

sed '
s|"\(.*\)"[[:blank:]]*:[[:blank:]]*"\(.*\)"|\1\
\2|
h
s|.*\n||
s|[\&/]|\\&|g
x
s|\n.*||
s|[[\.*^$/]|\\&|g
G
s|\(.*\)\n\(.*\)|s/\1/\2/g|
' dictionary.txt | sed -f - novel.txt

How it works:
The 1st sed turns dictionary.txt into a script-file (editing commands, one per line). This is piped to the 2nd sed (note the -f - which means read commands from stdin) that executes those commands, editing novel.txt.
This requires translating your format

"STRING"   :   "REPLACEMENT"

into a sed command and escaping any special characters in the process for both LHS and RHS:

s/ESCAPED_STRING/ESCAPED_REPLACEMENT/g

So the first substitution

s|"\(.*\)"[[:blank:]]*:[[:blank:]]*"\(.*\)"|\1\
\2|

turns "STRING" : "REPLACEMENT" into STRING\nREPLACEMENT (\n is a newline char). The result is then copied over the hold space.
s|.*\n|| deletes the first part keeping only REPLACEMENT then s|[\&/]|\\&|g escapes the reserved characters (this is the RHS).
It then exchanges the hold buffer with the pattern space and s|\n.*|| deletes the second part keeping only STRING and s|[[\.*^$/]|\\&|g does the escaping (this is the LHS).
The content of the hold buffer is then appended to pattern space via G so now the pattern space content is ESCAPED_STRING\nESCAPED_REPLACEMENT.
The final substitution

s|\(.*\)\n\(.*\)|s/\1/\2/g|

transforms it into s/ESCAPED_STRING/ESCAPED_REPLACEMENT/g

Best Answer

Related Solutions

Non-line-oriented tool for string replacement

Explanation

Explanation

Text Processing Sed Awk Gawk – String Replacement Using a Dictionary

Related Question