Ubuntu – Find & Delete duplicated files on multiple harddisks at once

duplicateduplicate filesscripts

I have 4 hard disks and want to find out which files on this four harddisks (including sub directories) are duplicates. It should be checked not only within a harddisk but against all the others as well.

The hard disks are large (3TB) therefor it has to be efficient (first filename than checksum checks or so)

Best Answer

The script below looks for duplicate files in up to 10 directories at once, looking for duplicates in the combined directories.

It should be dramatically faster than both fdupes (running fdupes -r) and fslint; on a relatively small directory of 40GB, locally stored, it took the script 5 seconds to create a dupe list, while it took fdupes and fslint much longer (~ 90 / 100 seconds). On a larger directory (700GB, ~350000 files) on a relatively slow external USB drive, it took 90 minutes. With fdupes it would have been over 200-250 minutes, looking at the progress indication (which is nice, the script below doesn't show the progress), but I didn't wait for it all the way.
I should mention that for example fslint offers additional functionality, which the script does not (as it is), so the comparison is strictly on creating the dupes list.

Furthermore, the speed depends for a part on how fast the disk reads: I tested several media (a.o on a network drive) with huge differences, especially on smaller directories, where creating the file list takes a relatively great part of the job ('s time).

The bottom line is that it won't be a quick job either which way, you might ask yourself if the directories aren't too large.

How it works

When the script finds duplicates, the duplicates are listed as follows:

Creating file list... /home/jacob/Bureaublad/test2
Creating file list... /home/jacob/Bureaublad/foto
Creating file list... /home/jacob/Bureaublad/Askubuntu
Checking for duplicates (10790 files)...
------------------------------------------------------------ 
>  found duplicate: test1.txt 2 

/home/jacob/Bureaublad/test2/test1.txt
/home/jacob/Bureaublad/test2/another directory/test1.txt
------------------------------------------------------------

and so on

The script

#!/usr/bin/env python3

import os
import sys

total_filelist = []
total_names = []

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            l.append(file)
            l2.append(root+"/"+file)
    return (l, l2)

i = 1
while i <= 10:
    try:
        dr = (sys.argv[i])
        print("Creating file list...", dr)
        total_filelist = total_filelist+find_files(dr)[1]
        total_names = total_names+find_files(dr)[0]
        i = i+1
    except IndexError:
        break

print("Checking for duplicates ("+str(len(total_names)),"files)...")

for name in set(total_names):
    n = total_names.count(name)
    if n > 1:
        print("-"*60,"\n>  found duplicate:",
              name, n, "\n")
        for item in total_filelist:
            if item.endswith("/"+name):
                print(item)

print("-"*60, "\n")

Copy it into an empty file, save it as find_dupes.py and run it by the command:

python3 <script> <directory1> <directory2> <directory3>

Up to max 10 directories

More options of the script

It is relatively simple to add additional functionality; move duplicates to another directory for example, renaming etc, so you can either manually or automatically decide which file to keep.

How to make the job doable

Your directories seem huge. To make the job reasonably possible, there is another, more sophisticated way to prevent the system from "choking": instead of doing the job on all file types (extensions) at once, you could cut the job into sections per file type. A small test on a directory of 30.000 files reduced the time from appr. 20 seconds (all files) to 0.3 second for one extension.

To make the script look for duplicates of only one file type, replace the section of the script:

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            l.append(file)
            l2.append(root+"/"+file)
    return (l, l2)

by:

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".py"): # example .py extension
                l.append(file)
                l2.append(root+"/"+file)
    return (l, l2)

Finding occurring file extensions

To list all occurring file extension in a directory, you can use the script below:

#!/usr/bin/env python3

import sys
import os

l = []
for root, dirs, files in os.walk(sys.argv[1]):
    for f in files:
        if (
        f.startswith("."),
        f.count(".") == 0,
        f.endswith("~"),
        ) == (False, False, False):
            l.append(f[f.rfind("."):])
for item in set(l):
    print(item)

Copy it into an empty file, save it as find_extensions.py and run it by the command:

python3 <script> <diretory>

Example ouput:

.txt
.mp3
.odt
.py
.desktop
.sh
.ods

Related Solutions

Ubuntu – How to delete duplicate files with fdupes

You can do this if you want to run this silently (I've just used it to clear 150 GB of dupes running on rackspace block storage ..£kerching!!)

fdupes -rdN dir/

r - recursive
d - preserver first file, delete other dupes
N - run silently (no prompt)

Ubuntu – How to find ONLY duplicate files that have different names

I have another, far more flexible and easy to use solution for you!

Copy the script below and paste it to /usr/local/bin/dupe-check (or any other location and file name, you need root permissions for this one).
Make it executable by running this command:

sudo chmod +x /usr/local/bin/dupe-check

As /usr/local/bin is in every user's PATH, everybody may now run it directly without specifying the location.

First, you should look at the help page of my script:

$ dupe-check --help
usage: dupe-check [-h] [-s COMMAND] [-r MAXDEPTH] [-e | -d] [-0]
                  [-v | -q | -Q] [-g] [-p] [-V]
                  [directory]

Check for duplicate files

positional arguments:
  directory             the directory to examine recursively (default '.')

optional arguments:
  -h, --help            show this help message and exit
  -s COMMAND, --hashsum COMMAND
                        external system command to generate hashes (default
                        'sha256sum')
  -r MAXDEPTH, --recursion-depth MAXDEPTH
                        the number of subdirectory levels to process: 0=only
                        current directory, 1=max. 1st subdirectory level, ...
                        (default: infinite)
  -e, --equal-names     only list duplicates with equal file names
  -d, --different-names
                        only list duplicates with different file names
  -0, --no-zero         do not list 0-byte files
  -v, --verbose         print hash and name of each examined file
  -q, --quiet           suppress status output on stderr
  -Q, --list-only       only list the duplicate files, no summary etc.
  -g, --no-groups       do not group equal duplicates
  -p, --path-only       only print the full path in the results list,
                        otherwise format output like this: `'FILENAME'
                        (FULL_PATH)´
  -V, --version         show program's version number and exit

You see, to get a list of all files in the current directory (and all subdirectories) with different file names, you need the -d flag and any valid combination of formatting options.

We still assume the same test environment. Files with similar name (and different number) have equal content:

.
├── dir1
│   ├── uname1
│   └── uname3
├── grps
├── lsbrelease
├── lsbrelease2
├── uname1
└── uname2

So we simply run:

$ dupe-check
Checked 7 files in total, 6 of them are duplicates by content.
Here's a list of all duplicate files:

'lsbrelease' (./lsbrelease)
'lsbrelease2' (./lsbrelease2)

'uname1' (./dir1/uname1)
'uname1' (./uname1)
'uname2' (./uname2)
'uname3' (./dir1/uname3)

And here is the script:

#! /usr/bin/env python3

VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO = 0, 4, 1
RELEASE_DATE, AUTHOR = "2016-02-11", "ByteCommander"

import sys
import os
import shutil
import subprocess
import argparse


class Printer:
    def __init__(self, normal=sys.stdout, stat=sys.stderr):
        self.__normal = normal
        self.__stat = stat
        self.__prev_msg = ""
        self.__first = True
        self.__max_width = shutil.get_terminal_size().columns
    def __call__(self, msg, stat=False):
        if not stat:
            if not self.__first:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", file=self.__stat)
            print(msg, file=self.__normal)
            print(self.__prev_msg, end="", flush=True, file=self.__stat)
        else:
            if len(msg) > self.__max_width:
                msg = msg[:self.__max_width-3] + "..."
            if not msg:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", flush=True, file=self.__stat)
            elif self.__first:
                print(msg, end="", flush=True, file=self.__stat)
                self.__first = False
            else:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", file=self.__stat)
                print("\r" + msg, end="", flush=True, file=self.__stat)
            self.__prev_msg = msg


def file_walker(top, maxdepth=None):
    dirs, files = [], []
    for name in os.listdir(top):
        (dirs if os.path.isdir(os.path.join(top, name)) else files).append(name)
    yield top, files
    if maxdepth != 0:
        for name in dirs:
            for x in file_walker(os.path.join(top, name), maxdepth-1):
                yield x


printx = Printer()
argparser = argparse.ArgumentParser(description="Check for duplicate files")
argparser.add_argument("directory", action="store", default=".", nargs="?",
                       help="the directory to examine recursively "
                            "(default '%(default)s')")
argparser.add_argument("-s", "--hashsum", action="store", default="sha256sum",
                       metavar="COMMAND", help="external system command to "
                       "generate hashes (default '%(default)s')")
argparser.add_argument("-r", "--recursion-depth", action="store", type=int,
                       default=-1, metavar="MAXDEPTH", 
                       help="the number of subdirectory levels to process: "
                       "0=only current directory, 1=max. 1st subdirectory "
                       "level, ... (default: infinite)")
arggroupn = argparser.add_mutually_exclusive_group()
arggroupn.add_argument("-e", "--equal-names", action="store_const", 
                       const="e", dest="name_filter",
                       help="only list duplicates with equal file names")
arggroupn.add_argument("-d", "--different-names", action="store_const",
                       const="d", dest="name_filter",
                       help="only list duplicates with different file names")
argparser.add_argument("-0", "--no-zero", action="store_true", default=False,
                       help="do not list 0-byte files")
arggroupo = argparser.add_mutually_exclusive_group()
arggroupo.add_argument("-v", "--verbose", action="store_const", const=0, 
                       dest="output_level",
                       help="print hash and name of each examined file")
arggroupo.add_argument("-q", "--quiet", action="store_const", const=2, 
                       dest="output_level",
                       help="suppress status output on stderr")
arggroupo.add_argument("-Q", "--list-only", action="store_const", const=3, 
                       dest="output_level",
                       help="only list the duplicate files, no summary etc.")
argparser.add_argument("-g", "--no-groups", action="store_true", default=False,
                       help="do not group equal duplicates")
argparser.add_argument("-p", "--path-only", action="store_true", default=False,
                       help="only print the full path in the results list, "
                            "otherwise format output like this: "
                            "`'FILENAME' (FULL_PATH)´")
argparser.add_argument("-V", "--version", action="version", 
                       version="%(prog)s {}.{}.{} ({} by {})".format(
                       VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO, 
                       RELEASE_DATE, AUTHOR))
argparser.set_defaults(name_filter="a", output_level=1)
args = argparser.parse_args()

hashes = {}
dupe_counter = 0
file_counter = 0
try:
    for root, filenames in file_walker(args.directory, args.recursion_depth):
        if args.output_level <= 1:
            printx("--> {} files ({} duplicates) processed - '{}'".format(
                    file_counter, dupe_counter, root), stat=True)
        for filename in filenames:
            path = os.path.join(root, filename)
            file_counter += 1
            filehash = subprocess.check_output(
                       [args.hashsum, path], universal_newlines=True).split()[0]
            if args.output_level == 0:
                printx(" ".join((filehash, path)))
            if filehash in hashes:
                dupe_counter += 1 if len(hashes[filehash]) > 1 else 2
                hashes[filehash].append((filename, path))
                if args.output_level <= 1:
                    printx("--> {} files ({} duplicates) processed - '{}'"
                           .format(file_counter, dupe_counter, root), stat=True)
            else:
                hashes[filehash] = [(filename, path)]
except FileNotFoundError:
    printx("ERROR: Directory not found!")
    exit(1)
except KeyboardInterrupt:
    printx("USER ABORTED SEARCH!")
    printx("Results so far:")

if args.output_level <= 1:
    printx("", stat=True)
    if args.output_level == 0:
        printx("")
if args.output_level <= 2:
    printx("Checked {} files in total, {} of them are duplicates by content."
            .format(file_counter, dupe_counter))

if dupe_counter == 0:
    exit(0)
elif args.output_level <= 2:
    printx("Here's a list of all duplicate{} files{}:".format(
            " non-zero-byte" if args.no_zero else "",
            " with different names" if args.name_filter == "d" else
            " with equal names" if args.name_filter == "e" else ""))

first_group = True
for filehash in hashes:
    if len(hashes[filehash]) > 1:
        if args.no_zero and os.path.getsize(hashes[filehash][0][0]) == 0:
            continue
        first_group = False
        if args.name_filter == "a":
            filtered = hashes[filehash]
        else:
            filenames = {}
            for filename, path in hashes[filehash]:
                if filename in filenames:
                    filenames[filename].append(path)
                else:
                    filenames[filename] = [path]
            filtered = [(filename, path) 
                    for filename in filenames if (
                    args.name_filter == "e" and len(filenames[filename]) > 1 or
                    args.name_filter == "d" and len(filenames[filename]) == 1)
                    for path in filenames[filename]]
        if len(filtered) == 0:
            continue
        if (not args.no_groups) and (args.output_level <= 2 or not first_group):
            printx("")
        for filename, path in sorted(filtered):
            if args.path_only:
                printx(path)
            else:
                printx("'{}' ({})".format(filename, path))