Find and list duplicate directories

directoryduplicatemacosscripting

I have directory that has a number of sub-directories and would like to find any duplicates. The folder structure looks something like this:

└── Top_Dir
    └── Level_1_Dir
        ├── standard_cat
        │   ├── files.txt
        ├── standard_dog
        │   └── files.txt
        └── standard_snake
            └── files.txt
    └── Level_2_Dir
        ├── standard_moon
        │   ├── files.txt
        ├── standard_sun
        │   └── files.txt
        └── standard_cat
            └── files.txt
    └── Level_3_Dir
        ├── standard_man
        │   ├── files.txt
        ├── standard_woman
        │   └── files.txt
        └── standard_moon
            └── files.txt

With the above example I would like to see an output of:

/top_dir/Level_1_Dir/standard_cat
/top_dir/Level_2_Dir/standard_cat
/top_dir/Level_2_Dir/standard_moon
/top_dir/Level_3_Dir/standard_moon

I have been doing some searching on how to get this done via bash and I got nothing. Anyone know a way to do this?

Best Answer

I had the same problem with my music collection... most tools/scripts were noisy (listing filenames) or did checksums of file contents, which is far too slow...

Special characters, spaces, and symbols made this challenging... the strategy is to MD5sum the sorted file names along with the parent directory, then the script can sort hashes to find duplicates. We must sort children file names, as find does not guarantee file order in two different directories.

Bash Script (Debian 10):

#!/bin/bash

# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3

if [ $# -eq 0 ]; then
    echo "Please supply tunes directory as first arg"
    exit 1
fi

# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
    cd "$line"
    children=$(find ./ -type f | sort)
    base=$(basename "$line")
    sum=$(echo $children $base | md5sum)
    echo $sum $line
done | sort -n | uniq -D -w 32

Directory structure:

user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!

Example output:

user@pc:~# ./find_duplicates  test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!
Related Question