Undo letterspacing with sed

perlregular expressionsedtext processing

I have a source text file containing text where some words are l e t t e r s p a c e d like the word "letterspaced" in this question (i.e., there is a space character between the letters of the word.

How can I undo letterspacing using sed?

A pattern like \{[A-Za-z] \}+[A-Za-z] captures a letterspaced word, and s/ //g takes the spaces out, but how do I extract a letterspaced word out of a line of text and undo letterspacing without harming the legitimate space characters in the rest of the text?

Best Answer

You can do it like this:

sed     -e's/ \([^ ][^ ]\)/\n\1/g' \
        -e's/\([^ ][^ ]\) /\1\n/g' \
        -e's/ //g;y/\n/ /
'       <<\IN
I have a source text file containing text where
some words are l e t t e r s p a c e d
like the word "letterspaced" in this question
(i.e., there is a space character between the
letters of the word. 
IN

The idea is to first find all spaces which are either preceded by or followed by two or more not-space characters and set them aside as newline characters. Next simply remove all remaining spaces. And last, translate all newlines back to spaces.

This is not perfect - without incorporating an entire dictionary of every word you could possibly use the best you will get is some kind of heuristic. This one's pretty good, though.

Also, depending on the sed you use, you might have to use a literal newline in place of the n I use in the first two substitution statements as well.

Aside from that caveat, though, this will work - and work very fast - with any POSIX sed. It doesn't need to do any costly lookaheads or behinds, because it just saves impossibles, which means it can handle all of pattern space for each substitution in a single address.

OUTPUT

I have a source text file containing text where some
words are letterspaced
like the word "letterspaced" in this question
(i.e., there is a space character between the
letters of the word.
Related Question