Remove adjacent duplicate lines while keeping the order

awksedsortuniq

I have a file with one column with names that repeat a number of times each. I want to condense each repeat into one, while keeping any other repeats of the same name that are not adjacent to other repeats of the same name.

E.g. I want to turn the left side to the right side:

Golgb1    Golgb1    
Golgb1    Akna
Golgb1    Spata20
Golgb1    Golgb1
Golgb1    Akna
Akna
Akna
Akna
Spata20
Spata20
Spata20
Golgb1
Golgb1
Golgb1
Akna
Akna
Akna

This is what I've been using: perl -ne 'print if ++$k{$_}==1' file.txt > file2.txt
However, this method only keeps one representative from the left (i.e. Golb1 and Akna are not repeated).

Is there a way to keep unique names for each block, while keeping names that repeat in multiple, non-adjacent blocks?

Best Answer

uniq will do this for you:

$ uniq inputfile
Golgb1
Akna
Spata20
Golgb1
Akna

Related Solutions

AWK – How to Remove Duplicate Lines While Keeping Empty Lines

Another option is to check NF, eg:

awk '!NF || !seen[$0]++'

Remove duplicate lines while keeping the order of the lines

I doubt it will make a difference but, just in case, here's how to do the same thing in Perl:

perl -ne 'print if ++$k{$_}==1' out.txt

If the problem is keeping the unique lines in memory, that will have the same issue as the awk you tried. So, another approach could be:

cat -n out.txt | sort -k2 -k1n  | uniq -f1 | sort -nk1,1 | cut -f2-

How it works:

On a GNU system, cat -n will prepend the line number to each line following some amount of spaces and followed by a <tab> character. cat pipes this input representation to sort.
sort's -k2 option instructs it only to consider the characters from the second field until the end of the line when sorting, and sort splits fields by default on white-space (or cat's inserted spaces and <tab>).
When followed by -k1n, sort considers the 2nd field first, and then secondly—in the case of identical -k2 fields—it considers the 1st field but as sorted numerically. So repeated lines will be sorted together but in the order they appeared.
The results are piped to uniq—which is told to ignore the first field (-f1 - and also as separated by whitespace)—and which results in a list of unique lines in the original file and is piped back to sort.
This time sort sorts on the first field (cat's inserted line number) numerically, getting the sort order back to what it was in the original file and pipes these results to cut.
Lastly, cut removes the line numbers that were inserted by cat. This is effected by cut printing only from the 2nd field through the end of the line (and cut's default delimiter is a <tab> character).

To illustrate:

$ cat file
bb
aa
bb
dd
cc
dd
aa
bb
cc
$ cat -n file | sort -k2 | uniq -f1 | sort -k1 | cut -f2-
bb
aa    
dd
cc

Best Answer

Related Solutions

AWK – How to Remove Duplicate Lines While Keeping Empty Lines

Remove duplicate lines while keeping the order of the lines

Related Question