I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and these can easily be distinguished by the fact that a capital W does not ever occur in the middle of a word in true English. So, I need a sed one-liner that replaces all word-medial capital W's with the letters fi.
Sed one-liner to replace word-medial capitals
ocrsed
Best Answer
A capital W doesn't occur at the end of a word either, but it may occur in an all-caps abbreviation. So I'd replace
W
when it's immediately after a lowercase letter, or when it follows an uppercase letter and precedes a lowercase letter (aWre).This doesn't cover
fifi
(which my biggest word list only finds it in “fifing”). More importantly, this doesn't coverW
at the beginning of a word; you can capture some cases by looking at the second letter, but that's still going to miss many words that begin withfi
. In English, many letters never appear after a W:For more precise results and to cope with other languages, you can switch to a more complex dictionary-based approach (which fancy OCR systems often use, evidently yours isn't fancy enough).