Text file containing filenames and hashes – extracting lines with duplicate hashes

duplicatehashsumtext processing

I have generated a large text file containing filenames and sha-256 hashes using the format below – new line at the end of each line after the hashes.

file_1.txt 8208ad321576b521b23b07b9ba598e5c43b03ec4172c96fdbd35a858ec205ae6

file_2.txt ee508a6e34a2383db1b177cb9527bed16ba72b47ceb4d33ab71b47a44c1d0c31

file_3.txt aaf6b8c4a95d0e8f191784943ba1ea5c0b4d4baab733efe8ceb8b35478b6afd2

When I say large – it's in the millions of lines – millions of hashes.

It took me quite a while to generate the hashes – since the files span over 30 hard drives using a duplicate file finding program is impossible – the filenames contain the drive on which the file is stored.

It's time to free up some disk space.

I want to DELETE the lines in the text file that have a unique hash that only occurs once.

I want to KEEP ALL the lines in the text file that have a hash that occurs twice or more.

Best Answer

you could do worse than this two-pass awk solution

awk 'NR == FNR{if ($2 in a) b[$2]++;a[$2]++; next}; $2 in b' file file

In the first pass, use array b to keep track of hash values that are encountered more than once. In the second pass, print a record if it's hash exists within b

Alternately

sort -k2,2 file | uniq -f 1 -D

which involves sorting the file by the second field and piping to uniq to print all duplicate records (skipping the first field while comparing via the -f 1). Given the size of your input file this could turn out to be quite resource-intensive

Related Question