sed Command – Ignore Leading Whitespace When Substituting Globally

sed

I am trying to write a sed command to substitute excessive spaces in a file. Each word should have only one space between words, but leading spaces and tabs should be left alone. So the file:

     This is     an indented      paragraph. The   indentation   should not be changed.
This is the     second   line  of the    paragraph.

Will become:

     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

I have tried variations of

/^[ \t]*/!s/[ \t]+/ /g

Any ideas would be appreciated.

Best Answer

$ sed 's/\>[[:blank:]]\{1,\}/ /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

The expression I used matches one or several [[:blank:]] (spaces or tabs) after a word, and replaces these with a single space. The \> matches the zero-width boundary between a word-character and a non-word-character.

This was tested with OpenBSD's native sed, but I think it should work with GNU sed as well. GNU sed also uses \b for matching word boundaries.

You could also use sed -E to shorten this to

sed -E 's/\>[[:blank:]]+/ /g' file

Again, if \> doesn't work for you with GNU sed, use \b instead.

Note that although the above sorts out your example text in the correct way, it does not quite work for removing spaces after punctuation, as after the first sentence in

     This is     an indented      paragraph.        The   indentation   should not be changed.
This is the     second   line  of the    paragraph.

For that, a slightly more complicated variant would do the trick:

$ sed -E 's/([^[:blank:]])[[:blank:]]+/\1 /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

This replaces any non-blank character followed by one or more blank characters with the non-blank character and a single space.

Or, using standard sed (and a very tiny optimization in that it will only do the substitution if there are two or more spaces/tabs after the non-space/tab),

$ sed 's/\([^[:blank:]]\)[[:blank:]]\{2,\}/\1 /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

Related Solutions

Add leading zeros until all lines before the comma consist of nine characters and subsequently insert a character every three digits using sed

If your input don't have long sequence number in second field, try:

$ sed -e 's|^[^,]*|#000000000&|;s|#[^,]*\(.\{9\}\),|\1,|;s|\([0-9]\{3\}\)|\1/|g;s|/\([^0-9]\)|\1|;s|/$||' file
000/012/345,1s4c3v6s3nh6
123/456/789,9h5vgbdx34dc
000/000/012,7h4f45dcvbgh
001/234/567,09klijnmh563

Explanation

s|^[^,]*|#000000000&|: we match all thing from start to the first ,, replace it with a maker # and n numbers 0, where n is length we want to pad.
s|#[^,]*$.\{9\}$,|\1,|: we match all thing from the marker to the first ,, only keep the last 9 characters before ,, discard the rest.
s|$[0-9]\{3\}$|\1/|g: add a / each 3 sequence of digits.
s|/$[^0-9]$|\1|;s|/$||: if after / is not a number or / is at the end of line, we remove it.

or easier with perl:

$ perl -F',' -anle '
    $F[0] = sprintf "%09s", $F[0];
    $F[0] =~ s|.{3}|$&/|g;
    chop $F[0];
    print join ",",@F;
' file
000/012/345,1s4c3v6s3nh6
123/456/789,9h5vgbdx34dc
000/000/012,7h4f45dcvbgh
001/234/567,09klijnmh563

Undo letterspacing with sed

You can do it like this:

sed     -e's/ \([^ ][^ ]\)/\n\1/g' \
        -e's/\([^ ][^ ]\) /\1\n/g' \
        -e's/ //g;y/\n/ /
'       <<\IN
I have a source text file containing text where
some words are l e t t e r s p a c e d
like the word "letterspaced" in this question
(i.e., there is a space character between the
letters of the word. 
IN

The idea is to first find all spaces which are either preceded by or followed by two or more not-space characters and set them aside as newline characters. Next simply remove all remaining spaces. And last, translate all newlines back to spaces.

This is not perfect - without incorporating an entire dictionary of every word you could possibly use the best you will get is some kind of heuristic. This one's pretty good, though.

Also, depending on the sed you use, you might have to use a literal newline in place of the n I use in the first two substitution statements as well.

Aside from that caveat, though, this will work - and work very fast - with any POSIX sed. It doesn't need to do any costly lookaheads or behinds, because it just saves impossibles, which means it can handle all of pattern space for each substitution in a single address.

OUTPUT

I have a source text file containing text where some
words are letterspaced
like the word "letterspaced" in this question
(i.e., there is a space character between the
letters of the word.

Best Answer

Related Solutions

Add leading zeros until all lines before the comma consist of nine characters and subsequently insert a character every three digits using sed

Undo letterspacing with sed

OUTPUT

Related Question