Text Processing – How to Delete Duplicate Lines Pairwise

sedtext processinguniq

I encountered this use case today. It seems simple at first glance, but fiddling around with sort, uniq, sed and awk revealed that it's nontrivial.

How can I delete all pairs of duplicate lines? In other words, if there is an even number of duplicates of a given line, delete all of them; if there is an odd number of duplicate lines, delete all but one. (Sorted input can be assumed.)

A clean elegant solution is preferable.

Example input:

a
a
a
b
b
c
c
c
c
d
d
d
d
d
e

Example output:

a
d
e

Best Answer

I worked out the sed answer not long after I posted this question; no one else has used sed so far so here it is:

sed '$!N;/^\(.*\)\n\1$/d;P;D'

A little playing around with the more general problem (what about deleting lines in sets of three? Or four, or five?) provided the following extensible solution:

sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp

Extended to remove triples of lines:

sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp

Or to remove quads of lines:

sed -e ':top' -e '$!{/\n.*\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1\n\1$/d;P;D' temp

sed has an additional advantage over most other options, which is its ability to truly operate in a stream, with no more memory storage needed than the actual number of lines to be checked for duplicates.

As cuonglm pointed out in the comments, setting the locale to C is necessary to avoid failures to properly remove lines containing multi-byte characters. So the commands above become:

LC_ALL=C sed '$!N;/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp
# Etc.

Related Solutions

Concatenate n lines with sed

sed -e:n -e$\bo -e'N;s/\n/&/4;to' -ebn -e:o -e'y/\n/ /' <in >out

That will concatenate 5 lines - or 1 + 4 lines - replacing each newline with a single space. However:

paste -d\  - - - - - <in >out

...would also work.

Your g sort thing could work like:

paste -d\  - - <input   |
sed 's/.*;\(.*\)/\1;&/' |
sort -t\; -k1,1         |
cut  -d\; -f2-          |
tr \  \\n

...which would be a fairly general way of doing it, though it relies on there being no spaces in the input file. it joins every two lines on a space, the copies the last ; split field to the head of each line, sorts on the first field, then cuts it away and splits the lines back out.

Bash – Command line method to find repeat-word typos, with line numbers

Edited: added install and demo

You need to take care of at least some edge cases, like

repeated words at the end (and beginning) of the line.
search should be case insensitive, because of frequent errors like The the apple.
probably you want to restrict search only to word constituent to not match something like ( ( a + b) + c ) (repeated opening parentheses.
only full words should match to eliminate the thesis
When it comes to human language Unicode characters inside words should properly interpreted

All in all I recommend pcregrep solution:

pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file

Obviously color and line number (n option) is optional, but usually nice to have.

Install

On Debian-based distributions you can install via:

$ sudo apt-get install pcregrep

Example

Run the command on jefferson_typo.txt to see:

$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly

The above is just a text capture, but on a color-supported terminal, matches are colorized:

has has
and
and
be be