How to remove duplicate letters using sed

sedtext processing

Using sed, how can I remove duplicate letters from HEADERS within a text file?

NNAAMMEE
       nice - run a program with modified scheduling priority

SSYYNNOOPPSSIISS
       nice     [-n    adjustment]    [-adjustment]    [--adjustment=adjustment] [command [a$

Above is a an example. I want the output after parsing with sed to be:

NAME
       nice - run a program with modified scheduling priority

SYNOPSIS
       nice     [-n    adjustment]    [-adjustment]    [--adjustment=adjustment] [command [a$

Best Answer

Method #1

You can use this sed command to do it:

$ sed 's/\([A-Za-z]\)\1\+/\1/g' file.txt

Example

Using your above sample input I created a file, sample.txt.

$ sed 's/\([A-Za-z]\)\1\+/\1/g' sample.txt 
NAME
       nice - run a program with modified scheduling priority

       SYNOPSIS
              nice     [-n    adjustment]    [-adjustment] [--adjustment=adjustment] [command [a$

Method #2

There is also this method which will remove all the duplicate characters:

$ sed 's/\(.\)\1/\1/g' file.txt

Example

$ sed 's/\(.\)\1/\1/g' sample.txt 
NAME
    nice - run a program with modified scheduling priority

    SYNOPSIS
       nice   [-n  adjustment]  [-adjustment] [-adjustment=adjustment] [command [a$

Method #3 (just the upper case)

The OP asked if you could modify it so that only the upper case characters would be removed, here's how using a modified method #1.

Example

$ sed 's/\([A-Z]\)\1\+/\1/g' sample.txt 
NAME
       nice - run a program with modified scheduling priority

       SYNOPSIS
              nice     [-n    adjustment]    [-adjustment] [--adjustment=adjustment] [command [a$

Details of the above methods

All the examples make use of a technique where when a character is first encountered that's in the set of characters A-Z or a-z that it's value is saved. Wrapping parens around characters tells sed to save them for later. That value is then stored in a temporary variable that you can access either immediately or later on. These variables are named \1 and \2.

So the trick we're using is we match the first letter.

\([A-Za-z]\)

Then we turn around and use the value that we just saved as a secondary character that must occur right after the first one above, hence:

\([A-Za-z]\)\1.

In sed we're also making use of the search and replace facility, s/../../g. The g means we're doing it globally.

So when we encounter a character, followed by another one, we substitute it out, and replace it with just one of the same character.

Related Solutions

Using sed/awk to remove anything after first space

Sed

sed 's/\s.*$//'

Grep

grep -o '^\S*'

Awk

awk '{print $1}'

As pointed out in the comments, -o isn't POSIX; however both GNU and BSD have it, so it should work for most people.

Also, \s/\S may not be on all systems, if yours doesn't recognize it you can use a literal space, or if you want space and tab, those in a bracket expression ([...]), or the [[:blank:]] character class (note that strictly speaking \s is equivalent to [[:space:]] and includes vertical spacing characters as well like CR, LF or VT which you probably don't care about).

The awk one assumes the lines don't start with a blank character.

How to remove duplicate lines that begin with a pattern and the next line after that

You can use getline in your awk to fetch the next line:

awk '/^>/{ if(!seen[$0]++){ print;getline;print } else { getline } }'

There is a simpler answer that also handles multiple lines:

awk '/^>/{ skip = seen[$0]++ }
     { if(!skip)print }'

Best Answer

Method #1

Example

Method #2

Example

Method #3 (just the upper case)

Example

Details of the above methods

Related Solutions

Using sed/awk to remove anything after first space

How to remove duplicate lines that begin with a pattern and the next line after that

Related Question