Sed one-liner to replace word-medial capitals

ocrsed

I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and these can easily be distinguished by the fact that a capital W does not ever occur in the middle of a word in true English. So, I need a sed one-liner that replaces all word-medial capital W's with the letters fi.

Best Answer

A capital W doesn't occur at the end of a word either, but it may occur in an all-caps abbreviation. So I'd replace W when it's immediately after a lowercase letter, or when it follows an uppercase letter and precedes a lowercase letter (aWre).

sed -e 's/\([[:lower:]]\)W/\1fi/g' -e 's/\([[:alpha:]]\)W\([[:lower:]]\)/\1fi\2/g'

This doesn't cover fifi (which my biggest word list only finds it in “fifing”). More importantly, this doesn't cover W at the beginning of a word; you can capture some cases by looking at the second letter, but that's still going to miss many words that begin with fi. In English, many letters never appear after a W:

… -e 's/\([^[:alnum:]]\)W\([b-dfgj-npqstv-xz]\)/\1fi\2/g' \
  -e 's/^W\([b-dfgj-npqstv-xz]\)/fi\2/'

For more precise results and to cope with other languages, you can switch to a more complex dictionary-based approach (which fancy OCR systems often use, evidently yours isn't fancy enough).

Related Question