I have a simple xml file containing some hyphenated words over page breaks. The input is something like
ba bla bla hyphe-</page>
<page>nated bla bla bla
and the output should look like
bla bla bla</page>
<page>hyphenated bla bla bla
I am aware of the sed command N
, but I don't have control whether my hyphenation occurs on odd or even lines.
Can I do the hyphenation removal as skteched above with sed? Are there alternate ways of doing it (e.g. with other UNIX shell commands or with python or perl)?
EDIT. On request, a real example from my input files:
[...] and vapours, upon the comparison of the air-thermo-</page>
<page>meter with the mercurial thermometer, upon the elastic [...]
EDIT2: Alltho' I picked up the example rather randomly, it is indeed a very nasty one. The wanted output in this case is
[...] and vapours, upon the comparison of the</page>
<page>air-thermometer with the mercurial thermometer, upon the elastic [...]
i.e. use the space a word separator. The main problem for me is to write a pattern that spans the line break in the original. And yes, the pattern should only remove hyphens immediately preceeding </page>
Best Answer
Some kind of a monster) With perl it should be easier
It's GNU sed (in some other sed-s -E option is used for extended regular expressions)