I encountered this use case today. It seems simple at first glance, but fiddling around with sort
, uniq
, sed
and awk
revealed that it's nontrivial.
How can I delete all pairs of duplicate lines? In other words, if there is an even number of duplicates of a given line, delete all of them; if there is an odd number of duplicate lines, delete all but one. (Sorted input can be assumed.)
A clean elegant solution is preferable.
Example input:
a
a
a
b
b
c
c
c
c
d
d
d
d
d
e
Example output:
a
d
e
Best Answer
I worked out the
sed
answer not long after I posted this question; no one else has usedsed
so far so here it is:A little playing around with the more general problem (what about deleting lines in sets of three? Or four, or five?) provided the following extensible solution:
Extended to remove triples of lines:
Or to remove quads of lines:
sed
has an additional advantage over most other options, which is its ability to truly operate in a stream, with no more memory storage needed than the actual number of lines to be checked for duplicates.As cuonglm pointed out in the comments, setting the locale to C is necessary to avoid failures to properly remove lines containing multi-byte characters. So the commands above become: