Shell – How to find and delete duplicate files within the same directory

duplicatefilesfindshell-script

I want to find duplicate files, within a directory, and then delete all but one, to reclaim space. How do I achieve this using a shell script?

For example:

pwd
folder

Files in it are:

log.bkp
log
extract.bkp
extract

I need to compare log.bkp with all the other files and if a duplicate file is found (by it's content), I need to delete it. Similarly, file 'log' has to be checked with all other files, that follow, and so on.

So far, I have written this, But it's not giving desired result.

#!/usr/bin/env ksh
count=`ls -ltrh /folder | grep '^-'|wc -l`
for i in `/folder/*`
do
   for (( j=i+1; j<=count; j++ ))
   do
      echo "Current two files are $i and $j"
      sdiff -s $i  $j
      if [ `echo $?` -eq  0 ]
      then
         echo "Contents of $i and $j are same"
       fi
    done
 done

Best Answer

If you're happy to simply use a command line tool, and not have to create a shell script, the fdupes program is available on most distros to do this.

There's also the GUI based fslint tool that has the same functionality.

Related Solutions

Find files that have a confirmed duplicate in same directory recursively

It is a slightly long, but it is a single command-line. This looks at the contents of the files and compares them using a cryptographic hash (md5sum).

find . -type f -exec md5sum {} + | sort | sed 's/  */!/1' | awk -F\| 'BEGIN{first=1}{if($1==lastid){if(first){first=0;print lastid, lastfile}print$1, $2} else first=1; lastid=$1;lastfile=$2}'

As I said, this is a little long...

The find runs md5sum against all files in the current directory tree. Then the output is sortd by the md5 hash. Since whitespace could be in the filenames, the sed changes the first field separator (two spaces) to a vertical pipe (very unlikely to be in a filename).

The last awk command tracks three variables: lastid = the md5 hash from the previous entry, lastfile = the filename from previous entry, and first = lastid was first time seen.

The output includes the hash so you can see which files are duplicates of each other.

This does not indicate if files are hard links (same inode, different name); it will just compare the contents.

Update: corrected based on just basename of file.

find . -type f -print | sed 's!.*/\(.*\)\.[^.]*$!\1|&!' | awk -F\| '{i=indices[$1]++;found[$1,i]=$2}END{for(bname in indices){if(indices[bname]>1){for(i=0;i<indices[bname];i++){print found[bname,i]}}}}'

Here, the find just lists the filenames, the sed takes the basename component of the pathname and creates a two field table with the basename and the full pathname. The awk then creates a table ("found") of the pathnames seen, indexed by the basename and the item number; the "indices" array keeps track of how many of that basename have been seen. The "END" clause then prints out any duplicate basenames found.

shell-script – Remove Duplicate Lines from Multiple Files in a Folder

I modified the script to loop the files 4 by 4 - tested on like 20 files, looks like it's working. The script will store the filenames in an array and then it will loop them 4 by 4 :

    temp=$(mktemp)

    declare -a files=($(echo *.json|sort))
    length=$(echo ${#files[@]})

    for ((i=0;i<length;i++))
    do
      for ((j=0;j<=3;j++))
      do
        [ "${files[i]}" == "${files[i+j]}" ] && continue
        [ "$((i+j))" -ge "$length" ] && continue
        echo ${files[i]} ${files[i+j]}
        #grep -w -Ff ${files[i]} -v ${files[i+j]} > ${temp}
        #mv ${temp} ${files[i+j]}
      done
    done

I only echo the output here, if you think it's working fine, then remove the comments.

Best Answer

Related Solutions

Find files that have a confirmed duplicate in same directory recursively

shell-script – Remove Duplicate Lines from Multiple Files in a Folder

Related Question