Ubuntu – How does fdupes determine which file to keep and which to delete from set of duplicate files scattered in storage disk

command lineduplicate files

I just downloaded fdupes and was giving it a try. I am curious to know how the software goes about to determine which file it will put first when multiple files are found. I am running:

Distributor ID: Ubuntu
Description:    Ubuntu 12.04.3 LTS
Release:        12.04
Codename:       precise

Here is the command that I ran.

fdupes -Nrd /backup/local/fileserver_backup/home

in that "home" directory there are two directories with identical content (I used cp -r ./sam ./sam1):

sam/…

sam1/…

With the command above, I found that all the files were left in sam. But when I tried to run the same command with the following directory structure:

sa/…

sam/…

I found that all the files were still left in sam, not sa as I expected.

Now my Questions are:

  • Does fdupes always keep the oldest file?
  • How does it sort the files when finding the first and all subsequent duplicates?
  • Is this OS dependent?
  • Is this something that the user can control?

I have some 300000 lines of duplicate files. Being able to provide the software some guidance like, "always keep files in this directory when given the choice, skip if not available" or something like that would be great addition.

Best Answer

Here's a test I performed:

$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:14 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:02 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample3.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample3.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
$ touch -a 001/sample2.mp3 
$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:02 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample3.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample3.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
$ touch -m 001/sample3.mp3 
$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:32 001/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:34 001/sample3.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
001/sample3.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
./001/sample3.mp3
$ fdupes -rd ./001/ ./002/
[1] ./001/sample0.mp3                 
[2] ./001/sample2.mp3
[3] ./002/sample2.mp3
[4] ./001/sample.mp3
[5] ./001/sample3.mp3

Set 1 of 1, preserve files [1 - 5, all]: 4

   [-] ./001/sample0.mp3
   [-] ./001/sample2.mp3
   [-] ./002/sample2.mp3
   [+] ./001/sample.mp3
   [-] ./001/sample3.mp3

Conclusion:

The duplicate files are sorted in reverse order of latest modification time. So, the first file in the set of duplicates is the oldest in term of modification time (mtime).

That means if you use fdupes -rdN [directory] ..., the file with oldest mtime in each set of duplicates will be preserved and the rest will be deleted.

References:

Related Question