Remove hyphenation with sed

perlsedtext processing

I have a simple xml file containing some hyphenated words over page breaks. The input is something like

 ba bla bla hyphe-</page>
 <page>nated bla bla bla

and the output should look like

 bla bla bla</page>
 <page>hyphenated bla bla bla

I am aware of the sed command N, but I don't have control whether my hyphenation occurs on odd or even lines.

Can I do the hyphenation removal as skteched above with sed? Are there alternate ways of doing it (e.g. with other UNIX shell commands or with python or perl)?

EDIT. On request, a real example from my input files:

[...] and vapours, upon the comparison of the air-thermo-</page>
<page>meter with the mercurial thermometer, upon the elastic [...]

EDIT2: Alltho' I picked up the example rather randomly, it is indeed a very nasty one. The wanted output in this case is

 [...] and vapours, upon the comparison of the</page>
<page>air-thermometer with the mercurial thermometer, upon the elastic [...]

i.e. use the space a word separator. The main problem for me is to write a pattern that spans the line break in the original. And yes, the pattern should only remove hyphens immediately preceeding </page>

Best Answer

Some kind of a monster) With perl it should be easier

cat file
ba bla bla hyphe-</page>
<page>nated bla bla bla
and the output should look like

bla bla bla</page>
<page>hyphenated bla bla bla

It's GNU sed (in some other sed-s -E option is used for extended regular expressions)

sed -nr '/[[:alpha:]]+-<\/[[:alpha:]]+>$/{
N
s!([[:alpha:]]+)-(</[[:alpha:]]+>)\n(<[[:alpha:]]+>)([[:alpha:]]+)!\2\n\3\1\4!}
p' file
ba bla bla </page>
<page>hyphenated bla bla bla
and the output should look like

bla bla bla</page>
<page>hyphenated bla bla bla
Related Question