Find and list duplicate directories

directoryduplicatemacosscripting

I have directory that has a number of sub-directories and would like to find any duplicates. The folder structure looks something like this:

└── Top_Dir
    └── Level_1_Dir
        ├── standard_cat
        │   ├── files.txt
        ├── standard_dog
        │   └── files.txt
        └── standard_snake
            └── files.txt
    └── Level_2_Dir
        ├── standard_moon
        │   ├── files.txt
        ├── standard_sun
        │   └── files.txt
        └── standard_cat
            └── files.txt
    └── Level_3_Dir
        ├── standard_man
        │   ├── files.txt
        ├── standard_woman
        │   └── files.txt
        └── standard_moon
            └── files.txt

With the above example I would like to see an output of:

/top_dir/Level_1_Dir/standard_cat
/top_dir/Level_2_Dir/standard_cat
/top_dir/Level_2_Dir/standard_moon
/top_dir/Level_3_Dir/standard_moon

I have been doing some searching on how to get this done via bash and I got nothing. Anyone know a way to do this?

Best Answer

I had the same problem with my music collection... most tools/scripts were noisy (listing filenames) or did checksums of file contents, which is far too slow...

Special characters, spaces, and symbols made this challenging... the strategy is to MD5sum the sorted file names along with the parent directory, then the script can sort hashes to find duplicates. We must sort children file names, as find does not guarantee file order in two different directories.

Bash Script (Debian 10):

#!/bin/bash

# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3

if [ $# -eq 0 ]; then
    echo "Please supply tunes directory as first arg"
    exit 1
fi

# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
    cd "$line"
    children=$(find ./ -type f | sort)
    base=$(basename "$line")
    sum=$(echo $children $base | md5sum)
    echo $sum $line
done | sort -n | uniq -D -w 32

Directory structure:

user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!

Example output:

user@pc:~# ./find_duplicates  test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!

Related Solutions

Bash – n easy way to change directories from within one duplicate folder structure to another

Use string substitution in bash:

$ a="~/dev/com/wp-content/plugins/myplugin"
$ echo ${a/dev/staging}
~/staging/com/wp-content/plugins/myplugin

So a function like:

cdx () 
{
    cd "${PWD/$1/$2}"
}

And then do cdx dev staging to switch from a folder in dev to staging. With some checks, you could name the function cd:

cd ()
{
    if [ $# != 2 ]
    then
        builtin cd "$@"
    else
        builtin cd "${PWD/$1/$2}"
    fi
}

Effect:

~ # cd /tmp
/tmp # cd tmp srv
/srv # cd
~ # cd -
/srv
/srv # cd tmp var
/srv #

This retains the usual behaviour of cd in all cases, except for two arguments.

How to delete all duplicate hardlinks to a file

In the end it wasn't too hard to do this manually, based on Stéphane's and xenoid's hints and some prior experience with find.
I had to adapt a few commands to work with FreeBSD's non-GNU tools — GNU find has the -printf option that could have replaced the -exec stat, but FreeBSD's find doesn't have that.

# create a list of "<inode number> <tab> <full file path>"
find rsnapshots -type f -links +1 -exec stat -f '%i%t%R' {} + > inodes.txt

# sort the list by inode number (to have consecutive blocks of duplicate files)
sort -n inodes.txt > inodes.sorted.txt

# remove the first file from each block (we want to keep one link per inode)
awk -F'\t' 'BEGIN {lastinode = 0} {inode = 0+$1; if (inode == lastinode) {print $2}; lastinode = inode}' inodes.sorted.txt > inodes.to-delete.txt

# delete duplicates (watch out for special characters in the filename, and possibly adjust the read command and double quotes accordingly)
cat inodes.to-delete.txt | while read line; do rm -f "$line"; done

Best Answer

Related Solutions

Bash – n easy way to change directories from within one duplicate folder structure to another

How to delete all duplicate hardlinks to a file

Related Question