Shell – Sort files by highest number in filename

filesshell-script

I've got a bunch of files all named like this:

name_file-1.txt
name_file-2.txt
name_file-3.txt
some_other_file-1.txt
some_other_file-2.txt

There are thousands of different filenames, some with just one -1.txt at the end, some with -1.txt, -2.txt … -60.txt

I need to copy the highest numbers of each file, so name_file-3.txt, some_other_file-2.txt. How do I do that on a Linux command line?

Best Answer

With zsh:

typeset -A greatest
for f (*-*(n)) greatest[${f%-*}]=$f
cp -- $greatest /destination

*-*(n): non-hidden files whose name contains a - (*-*), sorted numerically ((n) glob qualifier).
${f%-*}: part of the filename up to the right-most - (or to the end if there's no -).
$greatest: expands to the non-empty values of the associative arrays. So here, for files that share the same root, only the file with the greatest number will be expanded.

Related Solutions

How to remove multiple files with a common prefix and suffix

rm sequence_1*.hmf

removes files beginning with sequence_1 and ending with .hmf.

Globbing is the process in which your shell takes a pattern and expands it into a list of filenames matching that pattern. Do not confuse it with regular expressions, which is different. If you spend most of your time in bash, the Wooledge Wiki has a good page on globbing (pathname expansion). If you want maximum portability, you'll want to read the POSIX spec on pattern matching as well / instead.

In the unlikely case you run into an "Argument list too long" error, you can take a look at BashFAQ 95, which addresses this. The simplest workaround is to break up the glob pattern into multiple smaller chunks, until the error goes away. In your case, you could probably get away with splitting the match by prefix digits 0 through 9, as follows:

for c in {0..9}; do rm sequence_1_"$c"*.hmf; done
rm sequence_1*.hmf  # catch-all case

Shell Script – Sort Files into Multiple Directories Based on Filename

As already noted, the short answer is "yes".

The long answer is: You can do it with a bash script that uses awk to extract the filename elements you want to base your directory structure on. It could look something like this (where more emphasis is placed on readability than "one-liner" compactness).

#!/bin/bash


for FILE in p-*
do
    if [[ ! -f $FILE ]]; then continue; fi

    LVL1="$(awk '{match($1,"^p-([[:digit:]]+)_[[:print:]]*",fields); print fields[1]}' <<< $FILE)"
    LVL2="$(awk '{match($1,"^p-([[:digit:]]+)_n-([[:digit:]]+)_[[:print:]]*",fields); print fields[2]}' <<< $FILE)"

    echo "move $FILE to p-$LVL1/n-$LVL2"
    if [[ ! -d "p-$LVL1" ]]
    then
    mkdir "p-$LVL1"
    fi

    if [[ ! -d "p-$LVL1/n-$LVL2" ]]
    then
    mkdir "p-$LVL1/n-$LVL2"
    fi

    mv $FILE "p-$LVL1/n-$LVL2"
done

To explain:

We perform a loop over all files starting with "p-" in the current directory.
The first instruction in the loop ensures that the file exists and is a workaround for empty directories (the reason why this is necessary is that on this forum, you will always be told not to parse the output of ls, so something like FILES=$(ls p-*); for FILE in $FILES; do ... would be considered a no-go).
Then, we extract the numerals between p- and _n needed to generate the first level of your directory structure using awk (as you suspected, with regular expressions), the same for the numerals between n- and _a for the second level. The idea is to use the match function which not only looks for the place where the specified regular expression occurs in your input, but also gives you the "completed" value of all elements enclosed in round brackets ( ... ) in the array "fields".
Third, we check if the directories for the first and second level of your intended directory structure already exist. If not, we create them.
Last, we move the file to the target directory.

For more information, have a look at the Advanced bash scripting guide and the GNU Awk Users Guide.

Once you are more firm in scripting and regular expressions, you can make this much more compact; in the above script, for example, the generation of the directory/subdirectory path could easily be contracted to just one awk call.

For one, since the directory names are actually p-<number> and n-<number>, the same as in your filename, we could have let awk do the work to extract these characters for us, too, by writing match($1,"(^p-[[:digit:]]+)_(n-[[:digit:]]+)_[[:print:]]*",fields)
We can further offload work to awk by having it generate the directory-subdirectory path at the same time with a suitable argument of print:

awk '{match($1,"(^p-[[:digit:]]+)_(n-[[:digit:]]+)_[[:print:]]*",fields); print fields[1] "/" fields[2]}'

would readily yield (e.g.) p-12345/n-384 for file p-12345_n-384_a-583.pdf. If we combine that with the usage of mkdir -p as indicated by @wurtel, the script could look like

for FILE in p-*
do
    if [[ ! -f $FILE ]]; then continue; fi

    TARGET="$(awk '{match($1,"(^p-[[:digit:]]+)_(n-[[:digit:]]+)_[[:print:]]*",fields); print fields[1] "/" fields[2]}' <<< $FILE)"
    echo "move $FILE to $TARGET"

    mkdir -p "$TARGET"
    mv $FILE $TARGET
done

Best Answer

Related Solutions

How to remove multiple files with a common prefix and suffix

Shell Script – Sort Files into Multiple Directories Based on Filename

Related Question