Script for deduplicating files and folders with a particular suffix

bashbash-scriptingdeduplicationpowershellscript

A botched OneDrive restoration has left me with many files and folders with a " (1)" or " (2)" suffix.

I would like a script (Bash is fine as I have MinGW + Cygwin, or PowerShell) that would parse all files and folders within a given folder (e.g. "d:\OneDrive" or "/cygdrive/d/OneDrive") and for each file or folder, see if there is one or more files/folders (in the same subfolder) whose file/folder name matches the regex "\1\s*$\d+$\.\2" where "\1" is the original file/folder name without extension, and \2 is the original extension. Then the script should binary compare the original file/folder to each of the files/folders found by the previous regex (recursively in the latter case) and if they are identical, it should delete the copy (the one with the longer filename).

While a possible basic structure of the script is clear (two nested for loops, find for finding files matching the regex, diff for comparison etc.) I'm not familiar enough with Bash scripts to comfortably put the pieces together, and there may well be a more efficient structure in any case (which would help given there are around half a million files to go through).

Best Answer

Here is one script that works and is reasonably efficient. Note that it does require precisely one space to have been added before the "(1)" and none to have been added after for it to work.

#!/usr/bin/bash
IFS=$'\n';
set -f
#Go deepest first to deal with copies within copied folders.
for copy in $(find . -regextype posix-egrep -regex "^.*\ \([0-9]+\)\s*(\.[^/.]*)?$" | awk '{print length($0)"\t"$0}' | sort -rnk1 | cut -f2-); do
    orig=$(rev <<< "$copy" | sed -E 's/\)[0-9]+\(\ //' | rev)
    if [ "$orig" != "$copy" ]; then
        if [ -f "$orig" ]; then
            if [ -f "$copy" ]; then
                echo "File pair: $orig $copy"
                if diff -q "$orig" "$copy" &>/dev/null; then
                    echo "Removing file: $copy"
                    rm -f "$copy";
                fi
            fi           
        fi
        if [ -d "$orig" ]; then
            if [ -d "$copy" ]; then
                echo "Folder pair: $orig $copy"
                if rmdir "$copy" &>/dev/null; then
                    #If the "copy" was an empty directory then we've removed it and so we're done.
                    echo "Removed empty folder: $copy"
                else
                    #Non-destructively ensure that both folders have the same files at least.                    
                    rsync -aHAv --ignore-existing "$orig/" "$copy" &>/dev/null
                    rsync -aHAv --ignore-existing "$copy/" "$orig" &>/dev/null
                    if diff -qr "$orig" "$copy" &>/dev/null; then
                        echo "Removing folder: $copy"
                        rm -rf "$copy";
                    fi            
                fi
            fi
        fi
    fi
done
unset IFS;
set +f

How it works

shopt -s nullglob

This sets nullglob so that a pathname expansion that returns no files will return an empty string.
for dir in path/to/*/; do

This starts a loop over all directories in your test dir.
files=("$dir"/*{mp3,flac})

This creates an array whose elements are any file names in $dir that end in mp3 or flac.
[ "${files[0]}" ] && mv "$dir" ~/test

If the array files is not empty (meaning that there is at least one mp3 or flac file in the directory), then move directory dir to ~/test.
done

This signals the end of the loop.

Best Answer

Related Solutions

Bash – Shell Script to Rename Multiple Files Using Parent Folder Name

Shell Script – Move Folders Containing Specific File Extensions

How it works

Related Question