I have a folder with duplicate (by md5sum
(md5
on a Mac)) files, and I want to have a cron job scheduled to remove any found.
However, I'm stuck on how to do this. What I have so far:
md5 -r * | sort
Which outputs something like this:
04c5d52b7acdfbecd5f3bdd8a39bf8fb gordondam_en-au11915031300_1366x768.jpg
1e88c6899920d2c192897c886e764fc2 fortbourtange_zh-cn9788197909_1366x768.jpg
266ea304b15bf4a5650f95cf385b16de nebraskasupercell_fr-fr11286079811_1366x768.jpg
324735b755c40d332213899fa545c463 grossescheidegg_en-us10868142387_1366x768.jpg
3993028fcea692328e097de50b26f540 Soyuz Spacecraft Rolled Out For Launch of One Year Crew.png
677bcd6006a305f4601bfb27699403b0 lechaustria_zh-cn7190263094_1366x768.jpg
80d03451b88ec29bff7d48f292a25ce6 ontariosunrise_en-ca10284703762_1366x768.jpg
b6d9d24531bc62d2a26244d24624c4b1 manateeday_row10617199289_1366x768.jpg
ca1486dbdb31ef6af83e5a40809ec561 Grueling Coursework.jpg
cdf26393577ac2a61b6ce85d22daed24 Star trails over Mauna Kea.jpg
dc3ad6658d8f8155c74054991910f39c smoocave_en-au10358472670_1366x768.jpg
dc3ad6658d8f8155c74054991910f39c smoocave_en-au10358472670_1366x7682.jpg
How can I process based on the MD5 of the file to remove duplicates? I don't really care which "original" I keep – but I only want to keep one.
Should I be approaching this in a different manner?
Best Answer
I'm working on Linux, which means the is the command
md5sum
which outputs:Now using
awk
andxargs
the command would be:The
awk
part initializeslasthash
with the empty string, which will not match any hash, and then checks for each line if the hash inlasthash
is the same as the hash (first column) of the current file (second column). If it is, it prints it out. At the end of every step it will setlasthash
to the hash of the current file (you could limit this to only be set if the hashes are different, but that should be a minor thing especially if you do not have many matching files). The filenames awk spits out are fed torm
withxargs
, which basically callsrm
with what theawk
part gives us.You probably need to filter directories before
md5sum *
.Edit:
Using Marcins method you could also use this one:
This substracts from the filelist optained by
ls
the first filename of each unique hash optained bymd5sum * | sort -k1 | uniq -w 32 | awk '{print $2}'
.