Find files that have a confirmed duplicate in same directory recursively

duplicatefilesfind

Say I have the following directory structure:

root
 |-- dirA
     |-- file.jpg
     |-- file-001.jpg <-- dup
     |-- file2.jpg
     |-- file3.jpg
 |-- dirB
     |-- fileA.jpg
     |-- fileA_ios.jpg <-- dup
     |-- fileB.jpg
     |-- fileC.jpg
 |-- dirC
     |-- fileX.jpg
     |-- fileX_ios.jpg <-- dup
     |-- fileX-001.jpg <-- dup
     |-- fileY.jpg
     |-- fileZ.jpg

So given a root folder, how can I find dups that have identical names (differing only by a suffix) recursively?

The name can be any string, and not necessarily file.... The suffixes can be 001, 002, 003 and so on. But it is safe to assume that there will be a 3-digit numeric pattern and _ios literally (for regex matching).

My linux foo is not very good.

Best Answer

It is a slightly long, but it is a single command-line. This looks at the contents of the files and compares them using a cryptographic hash (md5sum).

find . -type f -exec md5sum {} + | sort | sed 's/  */!/1' | awk -F\| 'BEGIN{first=1}{if($1==lastid){if(first){first=0;print lastid, lastfile}print$1, $2} else first=1; lastid=$1;lastfile=$2}'

As I said, this is a little long...

The find runs md5sum against all files in the current directory tree. Then the output is sortd by the md5 hash. Since whitespace could be in the filenames, the sed changes the first field separator (two spaces) to a vertical pipe (very unlikely to be in a filename).

The last awk command tracks three variables: lastid = the md5 hash from the previous entry, lastfile = the filename from previous entry, and first = lastid was first time seen.

The output includes the hash so you can see which files are duplicates of each other.

This does not indicate if files are hard links (same inode, different name); it will just compare the contents.

Update: corrected based on just basename of file.

find . -type f -print | sed 's!.*/\(.*\)\.[^.]*$!\1|&!' | awk -F\| '{i=indices[$1]++;found[$1,i]=$2}END{for(bname in indices){if(indices[bname]>1){for(i=0;i<indices[bname];i++){print found[bname,i]}}}}'

Here, the find just lists the filenames, the sed takes the basename component of the pathname and creates a two field table with the basename and the full pathname. The awk then creates a table ("found") of the pathnames seen, indexed by the basename and the item number; the "indices" array keeps track of how many of that basename have been seen. The "END" clause then prints out any duplicate basenames found.

Related Solutions

Duplicate Files – How to Find and Remove Duplicate Files

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {
    while IFS= read -r file; do
        [[ $file ]] && du "$file"
    done
} | sort -n > myjdups_sorted.txt

find -delete Command Deletes All Files Recursively – Why

The command line of find is made from different kinds of options, that are combined to form expressions.

The find option -delete is an action.
That means it is executed for each file matched so far.
As first option after the paths, all files are matched... oops!

It is dangerous - but the man page at least has a big warning:

From man find:

ACTIONS
    -delete
           Delete  files; true if removal succeeded.  If the removal failed, an
           error message is issued.  If -delete fails, find's exit status  will
           be nonzero (when it eventually exits).  Use of -delete automatically
           turns on the -depth option.

           Warnings: Don't forget that the find command line is evaluated as an
           expression,  so  putting  -delete first will make find try to delete
           everything below the starting points you specified.  When testing  a
           find  command  line  that  you later intend to use with -delete, you
           should explicitly specify -depth in order to avoid later  surprises.
           Because  -delete  implies -depth, you cannot usefully use -prune and
           -delete together.

From further up in man find:

EXPRESSIONS
    The expression is made up of options (which affect overall operation rather
    than  the  processing  of  a  specific file, and always return true), tests
    (which return a true or false value), and actions (which have side  effects
    and  return  a  true  or false value), all separated by operators.  -and is
    assumed where the operator is omitted.

    If the expression contains no actions other than  -prune,  -print  is  per‐
    formed on all files for which the expression is true.

On trying out what a find command will do:

To see what a command like

find . -name '*ar' -delete

will delete, you can first replace the action -delete by a more harmless action - like -fls or -print:

find . -name '*ar' -print

This will print which files are affected by the action.
In this example, the -print can be left out. In this case, there is not action at all, so the most obvious is added implicitly: -print. (See the second paragraph of the section "EXPRESSIONS" cited above)

Best Answer

Related Solutions

Duplicate Files – How to Find and Remove Duplicate Files

find -delete Command Deletes All Files Recursively – Why

Related Question