I have generated a large text file containing filenames and sha-256 hashes using the format below – new line at the end of each line after the hashes.
file_1.txt 8208ad321576b521b23b07b9ba598e5c43b03ec4172c96fdbd35a858ec205ae6
file_2.txt ee508a6e34a2383db1b177cb9527bed16ba72b47ceb4d33ab71b47a44c1d0c31
file_3.txt aaf6b8c4a95d0e8f191784943ba1ea5c0b4d4baab733efe8ceb8b35478b6afd2
When I say large – it's in the millions of lines – millions of hashes.
It took me quite a while to generate the hashes – since the files span over 30 hard drives using a duplicate file finding program is impossible – the filenames contain the drive on which the file is stored.
It's time to free up some disk space.
I want to DELETE the lines in the text file that have a unique hash that only occurs once.
I want to KEEP ALL the lines in the text file that have a hash that occurs twice or more.
Best Answer
you could do worse than this two-pass
awk
solutionIn the first pass, use array
b
to keep track of hash values that are encountered more than once. In the second pass, print a record if it's hash exists withinb
Alternately
which involves sorting the file by the second field and piping to
uniq
to print all duplicate records (skipping the first field while comparing via the-f 1
). Given the size of your input file this could turn out to be quite resource-intensive