Linux – How to Find Duplicate Files Based on Filename Characters

bashduplicatefileslinuxsearch

I am looking for a way in Linux shell, preferably bash to find duplicates of files based on first few letters of the filenames.

Where this would be useful:

I build mod packs for Minecraft. As of 1.14.4 Forge no longer errors if there are duplicate mods in a pack of higher versions. It simply stops the oldest versions from running. A script to help find these duplicates would be very advantageous.

Example listing:

minecolonies-0.13.312-beta-universal.jar   
minecolonies-0.13.386-alpha-universal.jar 

by quickly being able to identify the dupes i can keep the client pack small.

More information as requested

There is no specific format. However as you can see there at least 2 prevailing formats. Further there is no standard in community about what kind of characters to use or not use. Some use spaces (ick), some use [] (also ick), some use _'s (more ick), some use -'s (preferred but what can you do).

https://gist.github.com/be3cc9a77150194476b2000cb8ee16e5 for sample mods list of the filenames. Has been cleaned so no dupes in it.

https://gist.github.com/b0ac1e03145e893e880da45cf08ebd7a contains a sample where I deliberately made duplicates. It is an over-exaggeration of happens from time to time.

Deeper Explanation

I realize this might be resource heavy to do.

I would like to arbitrarily specify a slice range start to finish of all filenames to sample. Find duplicates based on that slice, and then hilight the duplicates. I don't need the script to actually delete them.

Extra Credit

The script would present a menu for files that it suspects match the duplication criterion allowing for easy deleting or renaming.

Best Answer

Filter possible duplicates

You could use some script to filter these files for possible duplicates. You can move into a new directory all files matching with at least another one, case insensitively, on the part before the first dash, underscore or space in their names. cd into your jars directory to run it.

#!/bin/bash
mkdir -p possible_dups

awk -F'[-_ ]' '
    NR==FNR {seen[tolower($1)]++; next}
    seen[tolower($1)] > 1
' <(printf "%s\n" *.jar) <(printf "%s\n" *.jar) |\
    xargs -r -d'\n' mv -t possible_dups/ --

Note: -r is a GNU extension to avoid running mv once with no file arguments when no possible duplicates are found. Also GNU parameter -d'\n' separates filenames by newlines, that means spaces and other usual characters are handled in the above command but not newlines.

You can edit the field separator assignment, -F'[-_ ]' to add or remove characters to define the end of the part we test for duplication. Now it means "dash or undescore or the space". It's generally good to catch more than the real duplication cases, like I probably do here.

Now you can inspect these files. You could also do directly the next step, on all files, without filtering, if you feel their number is not very large.


Visual inspection of possible duplicates

I suggest you to use a visual shell for this task, like mc, the Midnight Commander. You can easily install mc with the package management tool of your linux distribution.

You invoke mc into the directory you have these files, or you can navigate there. Using an X-terminal you can also have the mouse support but there are handy shortcuts for anything.

For example, follow the menu Left -> Sorting... -> untick "case sensitive" will give you the sorted view you want.

Navigate over the files using the arrows, and you can select many of them with Insert and then you can copy (F5), move (F6) or delete (F8) the hightlighted selections. Here is a screenshot of how it looks on your test data filtered:

enter image description here

Related Question