Find files that have a confirmed duplicate in same directory recursively

duplicatefilesfind

Say I have the following directory structure:

root
 |-- dirA
     |-- file.jpg
     |-- file-001.jpg <-- dup
     |-- file2.jpg
     |-- file3.jpg
 |-- dirB
     |-- fileA.jpg
     |-- fileA_ios.jpg <-- dup
     |-- fileB.jpg
     |-- fileC.jpg
 |-- dirC
     |-- fileX.jpg
     |-- fileX_ios.jpg <-- dup
     |-- fileX-001.jpg <-- dup
     |-- fileY.jpg
     |-- fileZ.jpg

So given a root folder, how can I find dups that have identical names (differing only by a suffix) recursively?

The name can be any string, and not necessarily file.... The suffixes can be 001, 002, 003 and so on. But it is safe to assume that there will be a 3-digit numeric pattern and _ios literally (for regex matching).

My linux foo is not very good.

Best Answer

It is a slightly long, but it is a single command-line. This looks at the contents of the files and compares them using a cryptographic hash (md5sum).

find . -type f -exec md5sum {} + | sort | sed 's/  */!/1' | awk -F\| 'BEGIN{first=1}{if($1==lastid){if(first){first=0;print lastid, lastfile}print$1, $2} else first=1; lastid=$1;lastfile=$2}'

As I said, this is a little long...

The find runs md5sum against all files in the current directory tree. Then the output is sortd by the md5 hash. Since whitespace could be in the filenames, the sed changes the first field separator (two spaces) to a vertical pipe (very unlikely to be in a filename).

The last awk command tracks three variables: lastid = the md5 hash from the previous entry, lastfile = the filename from previous entry, and first = lastid was first time seen.

The output includes the hash so you can see which files are duplicates of each other.

This does not indicate if files are hard links (same inode, different name); it will just compare the contents.

Update: corrected based on just basename of file.

find . -type f -print | sed 's!.*/\(.*\)\.[^.]*$!\1|&!' | awk -F\| '{i=indices[$1]++;found[$1,i]=$2}END{for(bname in indices){if(indices[bname]>1){for(i=0;i<indices[bname];i++){print found[bname,i]}}}}'

Here, the find just lists the filenames, the sed takes the basename component of the pathname and creates a two field table with the basename and the full pathname. The awk then creates a table ("found") of the pathnames seen, indexed by the basename and the item number; the "indices" array keeps track of how many of that basename have been seen. The "END" clause then prints out any duplicate basenames found.

Related Question