sed Command – Ignore Leading Whitespace When Substituting Globally

sed

I am trying to write a sed command to substitute excessive spaces in a file. Each word should have only one space between words, but leading spaces and tabs should be left alone. So the file:

     This is     an indented      paragraph. The   indentation   should not be changed.
This is the     second   line  of the    paragraph. 

Will become:

     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

I have tried variations of

/^[ \t]*/!s/[ \t]+/ /g

Any ideas would be appreciated.

Best Answer

$ sed 's/\>[[:blank:]]\{1,\}/ /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

The expression I used matches one or several [[:blank:]] (spaces or tabs) after a word, and replaces these with a single space. The \> matches the zero-width boundary between a word-character and a non-word-character.

This was tested with OpenBSD's native sed, but I think it should work with GNU sed as well. GNU sed also uses \b for matching word boundaries.

You could also use sed -E to shorten this to

sed -E 's/\>[[:blank:]]+/ /g' file

Again, if \> doesn't work for you with GNU sed, use \b instead.


Note that although the above sorts out your example text in the correct way, it does not quite work for removing spaces after punctuation, as after the first sentence in

     This is     an indented      paragraph.        The   indentation   should not be changed.
This is the     second   line  of the    paragraph.

For that, a slightly more complicated variant would do the trick:

$ sed -E 's/([^[:blank:]])[[:blank:]]+/\1 /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

This replaces any non-blank character followed by one or more blank characters with the non-blank character and a single space.

Or, using standard sed (and a very tiny optimization in that it will only do the substitution if there are two or more spaces/tabs after the non-space/tab),

$ sed 's/\([^[:blank:]]\)[[:blank:]]\{2,\}/\1 /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.
Related Question