Ubuntu – How to delete duplicate files with fdupes

duplicate files

When I run fdupes it finds more than 30,000 duplicate files. I need to keep one file and delete all the other duplicates (because some of them are systemfiles). Please give me a command or script to do this without pressing "1 or 2 or all" for each and every group of duplicate files.

Best Answer

You can do this if you want to run this silently (I've just used it to clear 150 GB of dupes running on rackspace block storage ..£kerching!!)

fdupes -rdN dir/

r - recursive
d - preserver first file, delete other dupes
N - run silently (no prompt)

Related Solutions

How fdupes Determines Which Duplicate Files to Keep

Here's a test I performed:

$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:14 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:02 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample3.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample3.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
$ touch -a 001/sample2.mp3 
$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:02 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample3.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample3.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
$ touch -m 001/sample3.mp3 
$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:32 001/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:34 001/sample3.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
001/sample3.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
./001/sample3.mp3
$ fdupes -rd ./001/ ./002/
[1] ./001/sample0.mp3                 
[2] ./001/sample2.mp3
[3] ./002/sample2.mp3
[4] ./001/sample.mp3
[5] ./001/sample3.mp3

Set 1 of 1, preserve files [1 - 5, all]: 4

   [-] ./001/sample0.mp3
   [-] ./001/sample2.mp3
   [-] ./002/sample2.mp3
   [+] ./001/sample.mp3
   [-] ./001/sample3.mp3

Conclusion:

The duplicate files are sorted in reverse order of latest modification time. So, the first file in the set of duplicates is the oldest in term of modification time (mtime).

That means if you use fdupes -rdN [directory] ..., the file with oldest mtime in each set of duplicates will be preserved and the rest will be deleted.

References:

Find Duplicate Files with Different Names

I have another, far more flexible and easy to use solution for you!

Copy the script below and paste it to /usr/local/bin/dupe-check (or any other location and file name, you need root permissions for this one).
Make it executable by running this command:

sudo chmod +x /usr/local/bin/dupe-check

As /usr/local/bin is in every user's PATH, everybody may now run it directly without specifying the location.

First, you should look at the help page of my script:

$ dupe-check --help
usage: dupe-check [-h] [-s COMMAND] [-r MAXDEPTH] [-e | -d] [-0]
                  [-v | -q | -Q] [-g] [-p] [-V]
                  [directory]

Check for duplicate files

positional arguments:
  directory             the directory to examine recursively (default '.')

optional arguments:
  -h, --help            show this help message and exit
  -s COMMAND, --hashsum COMMAND
                        external system command to generate hashes (default
                        'sha256sum')
  -r MAXDEPTH, --recursion-depth MAXDEPTH
                        the number of subdirectory levels to process: 0=only
                        current directory, 1=max. 1st subdirectory level, ...
                        (default: infinite)
  -e, --equal-names     only list duplicates with equal file names
  -d, --different-names
                        only list duplicates with different file names
  -0, --no-zero         do not list 0-byte files
  -v, --verbose         print hash and name of each examined file
  -q, --quiet           suppress status output on stderr
  -Q, --list-only       only list the duplicate files, no summary etc.
  -g, --no-groups       do not group equal duplicates
  -p, --path-only       only print the full path in the results list,
                        otherwise format output like this: `'FILENAME'
                        (FULL_PATH)´
  -V, --version         show program's version number and exit

You see, to get a list of all files in the current directory (and all subdirectories) with different file names, you need the -d flag and any valid combination of formatting options.

We still assume the same test environment. Files with similar name (and different number) have equal content:

.
├── dir1
│   ├── uname1
│   └── uname3
├── grps
├── lsbrelease
├── lsbrelease2
├── uname1
└── uname2

So we simply run:

$ dupe-check
Checked 7 files in total, 6 of them are duplicates by content.
Here's a list of all duplicate files:

'lsbrelease' (./lsbrelease)
'lsbrelease2' (./lsbrelease2)

'uname1' (./dir1/uname1)
'uname1' (./uname1)
'uname2' (./uname2)
'uname3' (./dir1/uname3)

And here is the script:

#! /usr/bin/env python3

VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO = 0, 4, 1
RELEASE_DATE, AUTHOR = "2016-02-11", "ByteCommander"

import sys
import os
import shutil
import subprocess
import argparse


class Printer:
    def __init__(self, normal=sys.stdout, stat=sys.stderr):
        self.__normal = normal
        self.__stat = stat
        self.__prev_msg = ""
        self.__first = True
        self.__max_width = shutil.get_terminal_size().columns
    def __call__(self, msg, stat=False):
        if not stat:
            if not self.__first:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", file=self.__stat)
            print(msg, file=self.__normal)
            print(self.__prev_msg, end="", flush=True, file=self.__stat)
        else:
            if len(msg) > self.__max_width:
                msg = msg[:self.__max_width-3] + "..."
            if not msg:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", flush=True, file=self.__stat)
            elif self.__first:
                print(msg, end="", flush=True, file=self.__stat)
                self.__first = False
            else:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", file=self.__stat)
                print("\r" + msg, end="", flush=True, file=self.__stat)
            self.__prev_msg = msg


def file_walker(top, maxdepth=None):
    dirs, files = [], []
    for name in os.listdir(top):
        (dirs if os.path.isdir(os.path.join(top, name)) else files).append(name)
    yield top, files
    if maxdepth != 0:
        for name in dirs:
            for x in file_walker(os.path.join(top, name), maxdepth-1):
                yield x


printx = Printer()
argparser = argparse.ArgumentParser(description="Check for duplicate files")
argparser.add_argument("directory", action="store", default=".", nargs="?",
                       help="the directory to examine recursively "
                            "(default '%(default)s')")
argparser.add_argument("-s", "--hashsum", action="store", default="sha256sum",
                       metavar="COMMAND", help="external system command to "
                       "generate hashes (default '%(default)s')")
argparser.add_argument("-r", "--recursion-depth", action="store", type=int,
                       default=-1, metavar="MAXDEPTH", 
                       help="the number of subdirectory levels to process: "
                       "0=only current directory, 1=max. 1st subdirectory "
                       "level, ... (default: infinite)")
arggroupn = argparser.add_mutually_exclusive_group()
arggroupn.add_argument("-e", "--equal-names", action="store_const", 
                       const="e", dest="name_filter",
                       help="only list duplicates with equal file names")
arggroupn.add_argument("-d", "--different-names", action="store_const",
                       const="d", dest="name_filter",
                       help="only list duplicates with different file names")
argparser.add_argument("-0", "--no-zero", action="store_true", default=False,
                       help="do not list 0-byte files")
arggroupo = argparser.add_mutually_exclusive_group()
arggroupo.add_argument("-v", "--verbose", action="store_const", const=0, 
                       dest="output_level",
                       help="print hash and name of each examined file")
arggroupo.add_argument("-q", "--quiet", action="store_const", const=2, 
                       dest="output_level",
                       help="suppress status output on stderr")
arggroupo.add_argument("-Q", "--list-only", action="store_const", const=3, 
                       dest="output_level",
                       help="only list the duplicate files, no summary etc.")
argparser.add_argument("-g", "--no-groups", action="store_true", default=False,
                       help="do not group equal duplicates")
argparser.add_argument("-p", "--path-only", action="store_true", default=False,
                       help="only print the full path in the results list, "
                            "otherwise format output like this: "
                            "`'FILENAME' (FULL_PATH)´")
argparser.add_argument("-V", "--version", action="version", 
                       version="%(prog)s {}.{}.{} ({} by {})".format(
                       VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO, 
                       RELEASE_DATE, AUTHOR))
argparser.set_defaults(name_filter="a", output_level=1)
args = argparser.parse_args()

hashes = {}
dupe_counter = 0
file_counter = 0
try:
    for root, filenames in file_walker(args.directory, args.recursion_depth):
        if args.output_level <= 1:
            printx("--> {} files ({} duplicates) processed - '{}'".format(
                    file_counter, dupe_counter, root), stat=True)
        for filename in filenames:
            path = os.path.join(root, filename)
            file_counter += 1
            filehash = subprocess.check_output(
                       [args.hashsum, path], universal_newlines=True).split()[0]
            if args.output_level == 0:
                printx(" ".join((filehash, path)))
            if filehash in hashes:
                dupe_counter += 1 if len(hashes[filehash]) > 1 else 2
                hashes[filehash].append((filename, path))
                if args.output_level <= 1:
                    printx("--> {} files ({} duplicates) processed - '{}'"
                           .format(file_counter, dupe_counter, root), stat=True)
            else:
                hashes[filehash] = [(filename, path)]
except FileNotFoundError:
    printx("ERROR: Directory not found!")
    exit(1)
except KeyboardInterrupt:
    printx("USER ABORTED SEARCH!")
    printx("Results so far:")

if args.output_level <= 1:
    printx("", stat=True)
    if args.output_level == 0:
        printx("")
if args.output_level <= 2:
    printx("Checked {} files in total, {} of them are duplicates by content."
            .format(file_counter, dupe_counter))

if dupe_counter == 0:
    exit(0)
elif args.output_level <= 2:
    printx("Here's a list of all duplicate{} files{}:".format(
            " non-zero-byte" if args.no_zero else "",
            " with different names" if args.name_filter == "d" else
            " with equal names" if args.name_filter == "e" else ""))

first_group = True
for filehash in hashes:
    if len(hashes[filehash]) > 1:
        if args.no_zero and os.path.getsize(hashes[filehash][0][0]) == 0:
            continue
        first_group = False
        if args.name_filter == "a":
            filtered = hashes[filehash]
        else:
            filenames = {}
            for filename, path in hashes[filehash]:
                if filename in filenames:
                    filenames[filename].append(path)
                else:
                    filenames[filename] = [path]
            filtered = [(filename, path) 
                    for filename in filenames if (
                    args.name_filter == "e" and len(filenames[filename]) > 1 or
                    args.name_filter == "d" and len(filenames[filename]) == 1)
                    for path in filenames[filename]]
        if len(filtered) == 0:
            continue
        if (not args.no_groups) and (args.output_level <= 2 or not first_group):
            printx("")
        for filename, path in sorted(filtered):
            if args.path_only:
                printx(path)
            else:
                printx("'{}' ({})".format(filename, path))