How to remove duplicate letters using sed

sedtext processing

Using sed, how can I remove duplicate letters from HEADERS within a text file?

NNAAMMEE
       nice - run a program with modified scheduling priority

SSYYNNOOPPSSIISS
       nice     [-n    adjustment]    [-adjustment]    [--adjustment=adjustment] [command [a$

Above is a an example. I want the output after parsing with sed to be:

NAME
       nice - run a program with modified scheduling priority

SYNOPSIS
       nice     [-n    adjustment]    [-adjustment]    [--adjustment=adjustment] [command [a$

Best Answer

Method #1

You can use this sed command to do it:

$ sed 's/\([A-Za-z]\)\1\+/\1/g' file.txt

Example

Using your above sample input I created a file, sample.txt.

$ sed 's/\([A-Za-z]\)\1\+/\1/g' sample.txt 
NAME
       nice - run a program with modified scheduling priority

       SYNOPSIS
              nice     [-n    adjustment]    [-adjustment] [--adjustment=adjustment] [command [a$

Method #2

There is also this method which will remove all the duplicate characters:

$ sed 's/\(.\)\1/\1/g' file.txt 

Example

$ sed 's/\(.\)\1/\1/g' sample.txt 
NAME
    nice - run a program with modified scheduling priority

    SYNOPSIS
       nice   [-n  adjustment]  [-adjustment] [-adjustment=adjustment] [command [a$

Method #3 (just the upper case)

The OP asked if you could modify it so that only the upper case characters would be removed, here's how using a modified method #1.

Example

$ sed 's/\([A-Z]\)\1\+/\1/g' sample.txt 
NAME
       nice - run a program with modified scheduling priority

       SYNOPSIS
              nice     [-n    adjustment]    [-adjustment] [--adjustment=adjustment] [command [a$

Details of the above methods

All the examples make use of a technique where when a character is first encountered that's in the set of characters A-Z or a-z that it's value is saved. Wrapping parens around characters tells sed to save them for later. That value is then stored in a temporary variable that you can access either immediately or later on. These variables are named \1 and \2.

So the trick we're using is we match the first letter.

\([A-Za-z]\)

Then we turn around and use the value that we just saved as a secondary character that must occur right after the first one above, hence:

\([A-Za-z]\)\1.

In sed we're also making use of the search and replace facility, s/../../g. The g means we're doing it globally.

So when we encounter a character, followed by another one, we substitute it out, and replace it with just one of the same character.

Related Question