Bash – Delete all but the most recent n file for each group of files that share the same prefix in a directory

bashfilesshell-scripttimestamps

My question is a bit different from some older questions simply asking for "deleting all but the most recent n files in a directory".

I have a directory that contains different 'groups' of files where each group of files share some arbitrary prefix and each group has at least one file. I do not know these prefixes in advance and I do not know how many groups there are.

EDIT: actually, I know something about the file names, that is they all follow the pattern prefix-some_digits-some_digits.tar.bz2. The only thing matters here is the prefix part, and we can assume that within each prefix there is no digit or dash.

I want to do the following in a bash script:

  1. Go through the given directory, identify all existing 'groups', and for each group of files, delete all but the most recent n files of the group only.

  2. If there are less than n files for a group, do nothing for that group, i.e. do not delete any file for that group.

What is a robust and safe way of doing the above in bash? Could you please explain the commands step-by-step?

Best Answer

The script:

#!/bin/bash

# Get Prefixes

PREFIXES=$(ls | grep -Po '^(.*)(?!HT\d{4})-(.*)-(.*).tar.bz2$' | awk -F'-' '{print $1}' | uniq)

if [ -z "$1" ]; then
  echo need a number of keep files.
  exit 1
else
  NUMKEEP=$1
fi

for PREFIX in ${PREFIXES}; do

  ALL_FILES=$(ls -t ${PREFIX}*)

  if [ $(echo ${ALL_FILES} | wc -w) -lt $NUMKEEP ]; then
    echo Not enough files to be kept. Quit.
    continue
  fi

  KEEP=$(ls -t ${PREFIX}* | head -n${NUMKEEP})

  for file in $ALL_FILES ; do
    if [[ "$KEEP" =~ "$file" ]]; then
      echo keeping $file
    else
      echo RM $file
    fi
  done
done

Explanation:

  • Calculate the prefixes:
    • Look for all files following the something-something-something.tar.bz2 regex, cutting of only the first part up to the first dash and make it unique.
    • the result is a normalized list of the PREFIXES
  • Iterate through all PREFIXES:
  • Calculate ALL_FILES with PREFIX
  • Check if the amount of ALL_FILES is less than the number of files to be kept -> if true, we can stop here, nothing to remove
  • Calculate the KEEP files which are the most recent NUMKEEP files
  • Iterate through ALL_FILES and check if the given file is not in the KEEP file list. If so: remove it.

Example result when running it:

$ ./remove-old.sh 2
keeping bar-01-01.tar.bz2
keeping bar-01-02.tar.bz2
RM bar-01-03.tar.bz2
RM bar-01-04.tar.bz2
RM bar-01-05.tar.bz2
RM bar-01-06.tar.bz2
keeping foo-01-06.tar.bz2
keeping foo-01-05.tar.bz2
RM foo-01-04.tar.bz2
RM foo-01-03.tar.bz2
RM foo-01-02.tar.bz2

$ ./remove-old.sh 8
Not enough files to be kept. Quit.
Not enough files to be kept. Quit.
Related Question