Text Processing – How to Delete Duplicate Lines Pairwise

sedtext processinguniq

I encountered this use case today. It seems simple at first glance, but fiddling around with sort, uniq, sed and awk revealed that it's nontrivial.

How can I delete all pairs of duplicate lines? In other words, if there is an even number of duplicates of a given line, delete all of them; if there is an odd number of duplicate lines, delete all but one. (Sorted input can be assumed.)

A clean elegant solution is preferable.

Example input:

a
a
a
b
b
c
c
c
c
d
d
d
d
d
e

Example output:

a
d
e

Best Answer

I worked out the sed answer not long after I posted this question; no one else has used sed so far so here it is:

sed '$!N;/^\(.*\)\n\1$/d;P;D'

A little playing around with the more general problem (what about deleting lines in sets of three? Or four, or five?) provided the following extensible solution:

sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp

Extended to remove triples of lines:

sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp

Or to remove quads of lines:

sed -e ':top' -e '$!{/\n.*\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1\n\1$/d;P;D' temp

sed has an additional advantage over most other options, which is its ability to truly operate in a stream, with no more memory storage needed than the actual number of lines to be checked for duplicates.


As cuonglm pointed out in the comments, setting the locale to C is necessary to avoid failures to properly remove lines containing multi-byte characters. So the commands above become:

LC_ALL=C sed '$!N;/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp
# Etc.
Related Question