Script for deduplicating files and folders with a particular suffix

bashbash-scriptingdeduplicationpowershellscript

A botched OneDrive restoration has left me with many files and folders with a " (1)" or " (2)" suffix.

I would like a script (Bash is fine as I have MinGW + Cygwin, or PowerShell) that would parse all files and folders within a given folder (e.g. "d:\OneDrive" or "/cygdrive/d/OneDrive") and for each file or folder, see if there is one or more files/folders (in the same subfolder) whose file/folder name matches the regex "\1\s*\(\d+\)\.\2" where "\1" is the original file/folder name without extension, and \2 is the original extension. Then the script should binary compare the original file/folder to each of the files/folders found by the previous regex (recursively in the latter case) and if they are identical, it should delete the copy (the one with the longer filename).

While a possible basic structure of the script is clear (two nested for loops, find for finding files matching the regex, diff for comparison etc.) I'm not familiar enough with Bash scripts to comfortably put the pieces together, and there may well be a more efficient structure in any case (which would help given there are around half a million files to go through).

Best Answer

Here is one script that works and is reasonably efficient. Note that it does require precisely one space to have been added before the "(1)" and none to have been added after for it to work.

#!/usr/bin/bash
IFS=$'\n';
set -f
#Go deepest first to deal with copies within copied folders.
for copy in $(find . -regextype posix-egrep -regex "^.*\ \([0-9]+\)\s*(\.[^/.]*)?$" | awk '{print length($0)"\t"$0}' | sort -rnk1 | cut -f2-); do
    orig=$(rev <<< "$copy" | sed -E 's/\)[0-9]+\(\ //' | rev)
    if [ "$orig" != "$copy" ]; then
        if [ -f "$orig" ]; then
            if [ -f "$copy" ]; then
                echo "File pair: $orig $copy"
                if diff -q "$orig" "$copy" &>/dev/null; then
                    echo "Removing file: $copy"
                    rm -f "$copy";
                fi
            fi           
        fi
        if [ -d "$orig" ]; then
            if [ -d "$copy" ]; then
                echo "Folder pair: $orig $copy"
                if rmdir "$copy" &>/dev/null; then
                    #If the "copy" was an empty directory then we've removed it and so we're done.
                    echo "Removed empty folder: $copy"
                else
                    #Non-destructively ensure that both folders have the same files at least.                    
                    rsync -aHAv --ignore-existing "$orig/" "$copy" &>/dev/null
                    rsync -aHAv --ignore-existing "$copy/" "$orig" &>/dev/null
                    if diff -qr "$orig" "$copy" &>/dev/null; then
                        echo "Removing folder: $copy"
                        rm -rf "$copy";
                    fi            
                fi
            fi
        fi
    fi
done
unset IFS;
set +f
Related Question