I have a text file with lines that are a mixture like this:
### Comments
# Comments
86.242.200.81 banana.domain.net # comment
86.242.200.3 orange.domain.net
31.28.225.81 monkey.anotherdomain.net
51.18.33.4 puffin.domainz.com
#31.28.220.80 monkey.anotherdomain.net # comment
86.242.201.3 orange.domain.net
How do I find the host.domain duplicates?
In this case, there are two: monkey.anotherdomain.net
and orange.domain.net
Taking into account that..
- Trailing comments after the entry need to be ignored, as they may not be on the duplicate.
- If the line is commented out, the duplicate should still be found.
- Differences in IP address should be ignored.
Best Answer
This was a fun one.
First, we need to eliminate trailing comments, as in:
We can do that with the following (assuming just spaces, no tabs):
If you have tabs in your hosts file, maybe run this first:
Then we need to eliminate "comment out this line" comments, which I'm going to define as a single hash character preceding an ip address. We can remove those like this:
Putting the above together gets us:
If we sort this on the second column (
sort -k2
), we get a list sorted by name:And now we can apply
uniq
to find duplicates, if we telluniq
to ignore the first field:Which gives us:
So if we look for lines with a count of 2 or higher, we have found our duplicates. Putting this all together we get:
The final
awk
statement in the above script looks for lines where the count fromuniq
(field1 ) is> 1
.Running the above script looks like this.