How to find duplicate lines in a text file, while some may be commented out or have different tokens at the beginning

text processinguniq

I have a text file with lines that are a mixture like this:

###  Comments

# Comments
86.242.200.81 banana.domain.net          # comment
86.242.200.3 orange.domain.net
31.28.225.81 monkey.anotherdomain.net

51.18.33.4 puffin.domainz.com
#31.28.220.80 monkey.anotherdomain.net   # comment
86.242.201.3 orange.domain.net

How do I find the host.domain duplicates?

In this case, there are two: monkey.anotherdomain.net and orange.domain.net

Taking into account that..

Trailing comments after the entry need to be ignored, as they may not be on the duplicate.
If the line is commented out, the duplicate should still be found.
Differences in IP address should be ignored.

Best Answer

This was a fun one.

First, we need to eliminate trailing comments, as in:

86.242.200.81 banana.domain.net          # comment

We can do that with the following (assuming just spaces, no tabs):

sed 's/  *#.*//'

If you have tabs in your hosts file, maybe run this first:

tr '\t' ' '

Then we need to eliminate "comment out this line" comments, which I'm going to define as a single hash character preceding an ip address. We can remove those like this:

sed '/^#[0-9]/ s/^#//'

Putting the above together gets us:

###  Comments

# Comments
86.242.200.81 banana.domain.net
86.242.200.3 orange.domain.net
31.28.225.81 monkey.anotherdomain.net

51.18.33.4 puffin.domainz.com
31.28.220.80 monkey.anotherdomain.net
86.242.201.3 orange.domain.net

If we sort this on the second column (sort -k2), we get a list sorted by name:

86.242.200.81 banana.domain.net
# Comments
###  Comments
31.28.220.80 monkey.anotherdomain.net
31.28.225.81 monkey.anotherdomain.net
86.242.200.3 orange.domain.net
86.242.201.3 orange.domain.net
51.18.33.4 puffin.domainz.com

And now we can apply uniq to find duplicates, if we tell uniq to ignore the first field:

uniq -c -f 1

Which gives us:

  2 
  1 86.242.200.81 banana.domain.net
  1 # Comments
  1 ###  Comments
  2 31.28.220.80 monkey.anotherdomain.net
  2 86.242.200.3 orange.domain.net
  1 51.18.33.4 puffin.domainz.com

So if we look for lines with a count of 2 or higher, we have found our duplicates. Putting this all together we get:

#!/bin/sh

tr '\t' ' ' |
sed '
    /^#[0-9]/ s/^#//
    s/  *#.*//
    /^ *$/ d
' |
sort -k2 |
uniq -f 1 -c |
awk '$1 > 1 {print}'

The final awk statement in the above script looks for lines where the count from uniq (field1 ) is > 1.

Running the above script looks like this.

Related Solutions

Text Processing – How to Remove Duplicate Lines in a Text File

An awk solution seen on #bash (Freenode):

awk '!seen[$0]++' filename

Text Processing – Print Lines from One File if Part Appears in Another

You can do this very easily using grep:

$ grep -Ff 123.txt 789.txt
http://www.a.com/kgjdk-jgjg/ 
http://www.b.com/gsjahk123/ 
http://www.c.com/abc.txt

The command above will print all lines from file 789.txt that contain any of the lines from 123.txt. The -f means "read the patterns to search from this file" and the -F tells grep to treat the search patterns as strings and not its default regular expressions.

This will not work if the lines of 123.txt contain trailing spaces, grep will treat the spaces as part of the pattern to look for an will not match if it occurs within a word. For example, the pattern foo (note the trailing space) will not match foobar. To remove trailing spaces from your file, run this command:

$ sed 's/ *$//' 123.txt > new_file

Then use the new_file to grep:

$ grep -Ff new_file 789.txt

You can also do this without a new file, using the i flag:

$ sed -i.bak 's/ *$//' 123.txt

This will change file 123.txt and keep a copy of the original called 123.txt.bak.

(Note that this form of the -i flag to sed assumes you have GNU sed; for BSD sed use -i .bak with a space in between.)

Best Answer

Related Solutions

Text Processing – How to Remove Duplicate Lines in a Text File

Text Processing – Print Lines from One File if Part Appears in Another

Related Question