Linux – How to Find Duplicate Files Based on Filename Characters

bashduplicatefileslinuxsearch

I am looking for a way in Linux shell, preferably bash to find duplicates of files based on first few letters of the filenames.

Where this would be useful:

I build mod packs for Minecraft. As of 1.14.4 Forge no longer errors if there are duplicate mods in a pack of higher versions. It simply stops the oldest versions from running. A script to help find these duplicates would be very advantageous.

Example listing:

minecolonies-0.13.312-beta-universal.jar   
minecolonies-0.13.386-alpha-universal.jar

by quickly being able to identify the dupes i can keep the client pack small.

More information as requested

There is no specific format. However as you can see there at least 2 prevailing formats. Further there is no standard in community about what kind of characters to use or not use. Some use spaces (ick), some use [] (also ick), some use _'s (more ick), some use -'s (preferred but what can you do).

https://gist.github.com/be3cc9a77150194476b2000cb8ee16e5 for sample mods list of the filenames. Has been cleaned so no dupes in it.

https://gist.github.com/b0ac1e03145e893e880da45cf08ebd7a contains a sample where I deliberately made duplicates. It is an over-exaggeration of happens from time to time.

Deeper Explanation

I realize this might be resource heavy to do.

I would like to arbitrarily specify a slice range start to finish of all filenames to sample. Find duplicates based on that slice, and then hilight the duplicates. I don't need the script to actually delete them.

Extra Credit

The script would present a menu for files that it suspects match the duplication criterion allowing for easy deleting or renaming.

Best Answer

Filter possible duplicates

You could use some script to filter these files for possible duplicates. You can move into a new directory all files matching with at least another one, case insensitively, on the part before the first dash, underscore or space in their names. cd into your jars directory to run it.

#!/bin/bash
mkdir -p possible_dups

awk -F'[-_ ]' '
    NR==FNR {seen[tolower($1)]++; next}
    seen[tolower($1)] > 1
' <(printf "%s\n" *.jar) <(printf "%s\n" *.jar) |\
    xargs -r -d'\n' mv -t possible_dups/ --

Note: -r is a GNU extension to avoid running mv once with no file arguments when no possible duplicates are found. Also GNU parameter -d'\n' separates filenames by newlines, that means spaces and other usual characters are handled in the above command but not newlines.

You can edit the field separator assignment, -F'[-_ ]' to add or remove characters to define the end of the part we test for duplication. Now it means "dash or undescore or the space". It's generally good to catch more than the real duplication cases, like I probably do here.

Now you can inspect these files. You could also do directly the next step, on all files, without filtering, if you feel their number is not very large.

Visual inspection of possible duplicates

I suggest you to use a visual shell for this task, like mc, the Midnight Commander. You can easily install mc with the package management tool of your linux distribution.

You invoke mc into the directory you have these files, or you can navigate there. Using an X-terminal you can also have the mouse support but there are handy shortcuts for anything.

For example, follow the menu Left -> Sorting... -> untick "case sensitive" will give you the sorted view you want.

Navigate over the files using the arrows, and you can select many of them with Insert and then you can copy (F5), move (F6) or delete (F8) the hightlighted selections. Here is a screenshot of how it looks on your test data filtered:

Related Solutions

Duplicate Files – How to Find and Remove Duplicate Files

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {
    while IFS= read -r file; do
        [[ $file ]] && du "$file"
    done
} | sort -n > myjdups_sorted.txt

Find files that have a confirmed duplicate in same directory recursively

It is a slightly long, but it is a single command-line. This looks at the contents of the files and compares them using a cryptographic hash (md5sum).

find . -type f -exec md5sum {} + | sort | sed 's/  */!/1' | awk -F\| 'BEGIN{first=1}{if($1==lastid){if(first){first=0;print lastid, lastfile}print$1, $2} else first=1; lastid=$1;lastfile=$2}'

As I said, this is a little long...

The find runs md5sum against all files in the current directory tree. Then the output is sortd by the md5 hash. Since whitespace could be in the filenames, the sed changes the first field separator (two spaces) to a vertical pipe (very unlikely to be in a filename).

The last awk command tracks three variables: lastid = the md5 hash from the previous entry, lastfile = the filename from previous entry, and first = lastid was first time seen.

The output includes the hash so you can see which files are duplicates of each other.

This does not indicate if files are hard links (same inode, different name); it will just compare the contents.

Update: corrected based on just basename of file.

find . -type f -print | sed 's!.*/\(.*\)\.[^.]*$!\1|&!' | awk -F\| '{i=indices[$1]++;found[$1,i]=$2}END{for(bname in indices){if(indices[bname]>1){for(i=0;i<indices[bname];i++){print found[bname,i]}}}}'

Here, the find just lists the filenames, the sed takes the basename component of the pathname and creates a two field table with the basename and the full pathname. The awk then creates a table ("found") of the pathnames seen, indexed by the basename and the item number; the "indices" array keeps track of how many of that basename have been seen. The "END" clause then prints out any duplicate basenames found.