Check all lines of a file are unique

text processing

I have a text file containing lines like this:

This is a thread  139737522087680
This is a thread  139737513694976
This is a thread  139737505302272
This is a thread  139737312270080
.
.
.
This is a thread  139737203164928
This is a thread  139737194772224
This is a thread  139737186379520

How can I be sure of the uniqueness of every line?

NOTE: The goal is to test the file, not to modify it if duplicate lines are present.

Best Answer

[ "$(wc -l < input)" -eq "$(sort -u input | wc -l)" ] && echo all unique

Related Solutions

How to find duplicate lines in a text file, while some may be commented out or have different tokens at the beginning

This was a fun one.

First, we need to eliminate trailing comments, as in:

86.242.200.81 banana.domain.net          # comment

We can do that with the following (assuming just spaces, no tabs):

sed 's/  *#.*//'

If you have tabs in your hosts file, maybe run this first:

tr '\t' ' '

Then we need to eliminate "comment out this line" comments, which I'm going to define as a single hash character preceding an ip address. We can remove those like this:

sed '/^#[0-9]/ s/^#//'

Putting the above together gets us:

###  Comments

# Comments
86.242.200.81 banana.domain.net
86.242.200.3 orange.domain.net
31.28.225.81 monkey.anotherdomain.net

51.18.33.4 puffin.domainz.com
31.28.220.80 monkey.anotherdomain.net
86.242.201.3 orange.domain.net

If we sort this on the second column (sort -k2), we get a list sorted by name:

86.242.200.81 banana.domain.net
# Comments
###  Comments
31.28.220.80 monkey.anotherdomain.net
31.28.225.81 monkey.anotherdomain.net
86.242.200.3 orange.domain.net
86.242.201.3 orange.domain.net
51.18.33.4 puffin.domainz.com

And now we can apply uniq to find duplicates, if we tell uniq to ignore the first field:

uniq -c -f 1

Which gives us:

  2 
  1 86.242.200.81 banana.domain.net
  1 # Comments
  1 ###  Comments
  2 31.28.220.80 monkey.anotherdomain.net
  2 86.242.200.3 orange.domain.net
  1 51.18.33.4 puffin.domainz.com

So if we look for lines with a count of 2 or higher, we have found our duplicates. Putting this all together we get:

#!/bin/sh

tr '\t' ' ' |
sed '
    /^#[0-9]/ s/^#//
    s/  *#.*//
    /^ *$/ d
' |
sort -k2 |
uniq -f 1 -c |
awk '$1 > 1 {print}'

The final awk statement in the above script looks for lines where the count from uniq (field1 ) is > 1.

Running the above script looks like this.

How to edit the last n lines in a file

To replace commas with semicolons on the last n lines with ed:

n=3
ed -s input <<< '$-'$((n-1))$',$s/,/;/g\nwq'

Splitting that apart:

ed -s = run ed silently (don't report the bytes written at the end)
'$-' = from the end of the file ($) minus ...
$((n-1)) = n-1 lines ...
( $' ... ' = quote the rest of the command to protect it from the shell )
,$s/,/;/g = ... until the end of the file (,$), search and replace all commas with semicolons.
\nwq = end the previous command, then save and quit

To replace commas with semicolons on the last n lines with sed:

n=3
sed -i "$(( $(wc -l < input) - n + 1)),\$s/,/;/g" input

Breaking that apart:

-i = edit the file "in-place"
$(( ... )) = do some math:
$( wc -l < input) = get the number of lines in the file
-n + 1 = go backwards n-1 lines
,\$ = from n-1 lines until the end of the file:
s/,/;/g = replace the commas with semicolons.

Best Answer

Related Solutions

How to find duplicate lines in a text file, while some may be commented out or have different tokens at the beginning

How to edit the last n lines in a file

Related Question