Ubuntu – Find & Delete duplicated files on multiple harddisks at once

duplicateduplicate filesscripts

I have 4 hard disks and want to find out which files on this four harddisks (including sub directories) are duplicates. It should be checked not only within a harddisk but against all the others as well.

The hard disks are large (3TB) therefor it has to be efficient (first filename than checksum checks or so)

Best Answer

The script below looks for duplicate files in up to 10 directories at once, looking for duplicates in the combined directories.

It should be dramatically faster than both fdupes (running fdupes -r) and fslint; on a relatively small directory of 40GB, locally stored, it took the script 5 seconds to create a dupe list, while it took fdupes and fslint much longer (~ 90 / 100 seconds). On a larger directory (700GB, ~350000 files) on a relatively slow external USB drive, it took 90 minutes. With fdupes it would have been over 200-250 minutes, looking at the progress indication (which is nice, the script below doesn't show the progress), but I didn't wait for it all the way.
I should mention that for example fslint offers additional functionality, which the script does not (as it is), so the comparison is strictly on creating the dupes list.

Furthermore, the speed depends for a part on how fast the disk reads: I tested several media (a.o on a network drive) with huge differences, especially on smaller directories, where creating the file list takes a relatively great part of the job ('s time).

The bottom line is that it won't be a quick job either which way, you might ask yourself if the directories aren't too large.

How it works

When the script finds duplicates, the duplicates are listed as follows:

Creating file list... /home/jacob/Bureaublad/test2
Creating file list... /home/jacob/Bureaublad/foto
Creating file list... /home/jacob/Bureaublad/Askubuntu
Checking for duplicates (10790 files)...
------------------------------------------------------------ 
>  found duplicate: test1.txt 2 

/home/jacob/Bureaublad/test2/test1.txt
/home/jacob/Bureaublad/test2/another directory/test1.txt
------------------------------------------------------------ 

and so on

The script

#!/usr/bin/env python3

import os
import sys

total_filelist = []
total_names = []

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            l.append(file)
            l2.append(root+"/"+file)
    return (l, l2)

i = 1
while i <= 10:
    try:
        dr = (sys.argv[i])
        print("Creating file list...", dr)
        total_filelist = total_filelist+find_files(dr)[1]
        total_names = total_names+find_files(dr)[0]
        i = i+1
    except IndexError:
        break

print("Checking for duplicates ("+str(len(total_names)),"files)...")

for name in set(total_names):
    n = total_names.count(name)
    if n > 1:
        print("-"*60,"\n>  found duplicate:",
              name, n, "\n")
        for item in total_filelist:
            if item.endswith("/"+name):
                print(item)

print("-"*60, "\n")

Copy it into an empty file, save it as find_dupes.py and run it by the command:

python3 <script> <directory1> <directory2> <directory3> 

Up to max 10 directories

More options of the script

It is relatively simple to add additional functionality; move duplicates to another directory for example, renaming etc, so you can either manually or automatically decide which file to keep.

How to make the job doable

Your directories seem huge. To make the job reasonably possible, there is another, more sophisticated way to prevent the system from "choking": instead of doing the job on all file types (extensions) at once, you could cut the job into sections per file type. A small test on a directory of 30.000 files reduced the time from appr. 20 seconds (all files) to 0.3 second for one extension.

To make the script look for duplicates of only one file type, replace the section of the script:

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            l.append(file)
            l2.append(root+"/"+file)
    return (l, l2)

by:

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".py"): # example .py extension
                l.append(file)
                l2.append(root+"/"+file)
    return (l, l2)

Finding occurring file extensions

To list all occurring file extension in a directory, you can use the script below:

#!/usr/bin/env python3

import sys
import os

l = []
for root, dirs, files in os.walk(sys.argv[1]):
    for f in files:
        if (
        f.startswith("."),
        f.count(".") == 0,
        f.endswith("~"),
        ) == (False, False, False):
            l.append(f[f.rfind("."):])
for item in set(l):
    print(item)

Copy it into an empty file, save it as find_extensions.py and run it by the command:

python3 <script> <diretory>

Example ouput:

.txt
.mp3
.odt
.py
.desktop
.sh
.ods
Related Question