How to find duplicate lines in a text file, while some may be commented out or have different tokens at the beginning

text processinguniq

I have a text file with lines that are a mixture like this:

###  Comments

# Comments
86.242.200.81 banana.domain.net          # comment
86.242.200.3 orange.domain.net
31.28.225.81 monkey.anotherdomain.net

51.18.33.4 puffin.domainz.com
#31.28.220.80 monkey.anotherdomain.net   # comment
86.242.201.3 orange.domain.net

How do I find the host.domain duplicates?

In this case, there are two: monkey.anotherdomain.net and orange.domain.net

Taking into account that..

  • Trailing comments after the entry need to be ignored, as they may not be on the duplicate.
  • If the line is commented out, the duplicate should still be found.
  • Differences in IP address should be ignored.

Best Answer

This was a fun one.

First, we need to eliminate trailing comments, as in:

86.242.200.81 banana.domain.net          # comment

We can do that with the following (assuming just spaces, no tabs):

sed 's/  *#.*//'

If you have tabs in your hosts file, maybe run this first:

tr '\t' ' '

Then we need to eliminate "comment out this line" comments, which I'm going to define as a single hash character preceding an ip address. We can remove those like this:

sed '/^#[0-9]/ s/^#//'

Putting the above together gets us:

###  Comments

# Comments
86.242.200.81 banana.domain.net
86.242.200.3 orange.domain.net
31.28.225.81 monkey.anotherdomain.net

51.18.33.4 puffin.domainz.com
31.28.220.80 monkey.anotherdomain.net
86.242.201.3 orange.domain.net

If we sort this on the second column (sort -k2), we get a list sorted by name:

86.242.200.81 banana.domain.net
# Comments
###  Comments
31.28.220.80 monkey.anotherdomain.net
31.28.225.81 monkey.anotherdomain.net
86.242.200.3 orange.domain.net
86.242.201.3 orange.domain.net
51.18.33.4 puffin.domainz.com

And now we can apply uniq to find duplicates, if we tell uniq to ignore the first field:

uniq -c -f 1

Which gives us:

  2 
  1 86.242.200.81 banana.domain.net
  1 # Comments
  1 ###  Comments
  2 31.28.220.80 monkey.anotherdomain.net
  2 86.242.200.3 orange.domain.net
  1 51.18.33.4 puffin.domainz.com

So if we look for lines with a count of 2 or higher, we have found our duplicates. Putting this all together we get:

#!/bin/sh

tr '\t' ' ' |
sed '
    /^#[0-9]/ s/^#//
    s/  *#.*//
    /^ *$/ d
' |
sort -k2 |
uniq -f 1 -c |
awk '$1 > 1 {print}'

The final awk statement in the above script looks for lines where the count from uniq (field1 ) is > 1.

Running the above script looks like this.

Related Question