Text file containing filenames and hashes – extracting lines with duplicate hashes

duplicatehashsumtext processing

I have generated a large text file containing filenames and sha-256 hashes using the format below – new line at the end of each line after the hashes.

file_1.txt 8208ad321576b521b23b07b9ba598e5c43b03ec4172c96fdbd35a858ec205ae6

file_2.txt ee508a6e34a2383db1b177cb9527bed16ba72b47ceb4d33ab71b47a44c1d0c31

file_3.txt aaf6b8c4a95d0e8f191784943ba1ea5c0b4d4baab733efe8ceb8b35478b6afd2

When I say large – it's in the millions of lines – millions of hashes.

It took me quite a while to generate the hashes – since the files span over 30 hard drives using a duplicate file finding program is impossible – the filenames contain the drive on which the file is stored.

It's time to free up some disk space.

I want to DELETE the lines in the text file that have a unique hash that only occurs once.

I want to KEEP ALL the lines in the text file that have a hash that occurs twice or more.

Best Answer

you could do worse than this two-pass awk solution

awk 'NR == FNR{if ($2 in a) b[$2]++;a[$2]++; next}; $2 in b' file file

In the first pass, use array b to keep track of hash values that are encountered more than once. In the second pass, print a record if it's hash exists within b

Alternately

sort -k2,2 file | uniq -f 1 -D

which involves sorting the file by the second field and piping to uniq to print all duplicate records (skipping the first field while comparing via the -f 1). Given the size of your input file this could turn out to be quite resource-intensive

Related Solutions

Join Lines of Text with Repeated Beginning – Command Line Tips

This is standard procedure for awk

awk '
{
  k=$2
  for (i=3;i<=NF;i++)
    k=k " " $i
  if (! a[$1])
    a[$1]=k
  else
    a[$1]=a[$1] "<br>" k
}
END{
  for (i in a)
    print i "\t" a[i]
}' long.text.file

If file is sorted by first word in line the script can be more simple

awk '
{
  if($1==k)
    printf("%s","<br>")
  else {
    if(NR!=1)
      print ""
    printf("%s\t",$1)
  }
  for(i=2;i<NF;i++)
    printf("%s ",$i)
  printf("%s",$NF)
  k=$1
}
END{
print ""
}' long.text.file

Or just bash

unset n
while read -r word definition
do
    if [ "$last" = "$word" ]
    then
        printf "<br>%s" "$definition"
    else 
        if [ "$n" ]
        then
            echo
        else
            n=1
        fi
        printf "%s\t%s" "$word" "$definition"
        last="$word"
     fi
done < long.text.file
echo

What hash algorithms can I use in preseed’s passwd/user-password-crypted entry

You can use anything which is supported in the /etc/shadow file. The string given in the preseed file is just put into /etc/shadow. To create a salted password to make it more difficult just use mkpasswd with the salt option (-S):

mkpasswd -m sha-512 -S $(pwgen -ns 16 1) mypassword
$6$bLyz7jpb8S8gOpkV$FkQSm9YZt6SaMQM7LPhjJw6DFF7uXW.3HDQO.H/HxB83AnFuOCBRhgCK9EkdjtG0AWduRcnc0fI/39BjmL8Ee1

In the command above the salt is generated by pwgen.

Best Answer

Related Solutions

Join Lines of Text with Repeated Beginning – Command Line Tips

What hash algorithms can I use in preseed’s passwd/user-password-crypted entry

Related Question