shell-script – Remove Duplicate Lines from Multiple Files in a Folder

duplicateshell-script

I had a question about removing duplicate lines in multiple files and was provided with a useful script here: Remove duplicate lines from multiple JSON files while preserving file structure.

The problem is that my folder has 10000 files and each is 1.5 MB in size. The script has been running for days and is nowhere near done. My folder looks like this:

file.1424-417982.json
file.1424-417995.json
file.1424-418013.json
file.1424-418015.json
file.1424-418019.json
file.1424-418027.json    
(9994 more files)

I have determined that the duplicate lines are only in files within a specified range. There may be duplicate lines in the first four files above, but those lines won't be in any other files in the folder. Likewise, there may be duplicates in files 2-5, but not in the other files.
How do I modify the shell/bash script to only look for duplicates within a range of 4 files and do this sequentially almost 10000 times shifting the range from 1-4, 2-5, 3-6… 9996-10000?

Here is the code I was provided for looking for duplicates. I tested it on a test folder with only 6 files and it was fast enough.

#!/bin/bash
temp=$(mktemp)
for file_to_dedupe in $(echo *.json|sort)
do
   for file_to_strip in *.json
   do
      [ "$file_to_dedupe" == "$file_to_strip" ] && continue
      grep -w -Ff ${file_to_dedupe} -v ${file_to_strip} > ${temp}
      mv ${temp} ${file_to_strip}
   done
done

Best Answer

I modified the script to loop the files 4 by 4 - tested on like 20 files, looks like it's working. The script will store the filenames in an array and then it will loop them 4 by 4 :

    temp=$(mktemp)

    declare -a files=($(echo *.json|sort))
    length=$(echo ${#files[@]})

    for ((i=0;i<length;i++))
    do
      for ((j=0;j<=3;j++))
      do
        [ "${files[i]}" == "${files[i+j]}" ] && continue
        [ "$((i+j))" -ge "$length" ] && continue
        echo ${files[i]} ${files[i+j]}
        #grep -w -Ff ${files[i]} -v ${files[i+j]} > ${temp}
        #mv ${temp} ${files[i+j]}
      done
    done

I only echo the output here, if you think it's working fine, then remove the comments.

Related Question