Using sort with alphanumeric filenames

sort

I'm sorting the results of a find command which finds all of the files in the current directory:

find . -maxdepth 1 -type f -iname "*.flac" | sort

What I'm expecting is a list like this:

./Track 1.flac
./Track 2.flac
./Track 3.flac
...
./Track 9.flac
./Track 10.flac
./Track 11.flac

What I'm getting is a list like this:

./Track 10.flac
./Track 11.flac
./Track 1.flac
./Track 2.flac
./Track 3.flac
...
./Track 9.flac

Is there an option to sort which will put them in alphanumeric ascending order so that numbers are evaluated properly?

Best Answer

Try to pass the -n and -k2 command line options to sort. I.e.,

find . -maxdepth 1 -type f -iname "*.flac" | sort -n -k2

When I put your unsorted filenames into file 'data.txt' and run this command:

sort -k2 -n data.txt

I get this as output:

./Track 1.flac
./Track 2.flac
./Track 3.flac
./Track 9.flac
./Track 10.flac
./Track 11.flac

explanation of options:

-n (numeric sort) compare according to string numerical value
-k2 means sort on the 2nd field (and to the end of the line), 
    you could just restrict it to the second field with -k2,2

You didn't ask about this, and I didn't use it above, but it may come in handy some day.

-r reverse sort order

man page for sort

See my related post on SO about sorting according to different fields Sort by third column leaving first and second column intact (in linux) which explains more about the sort command.

Related Solutions

Shell – Find, count and sort all audio files. ALAC (M4A) files

You should decompose your goal into several steps easier to solve. This will have two advantages:

It will be easier to solve,
The resulting code will be clearer and more reusable.

The scripts below basically follows these steps:

Generate raw statistic files. An easy way is to append the file size and the file name in a temporary file named after the original file extension. So, if you have the file /path/to/foo.mp3 which is 3000000 large, it will append 3000000 /path/to/foo.mp3 at the end of a temporary file named mp3.
Handle specific cases. Here it will process the temporary file m4a and create the two other files m4a_aac and m4a_alac based on the test you gave in the question.
Generate output. All the needed information being now available, it just has to:
- Count the number of line in each temporary file to determine the number of file of this type,
- Sum up each size to get the total size of files of this type.

Here is the script:

#!/bin/sh

# This script takes the searched directory as first parameter.
# For instance: ./this-script.sh ~/Music

: ${1:?"You must pass the search directory as first parameter."}
searchdir="$1"

# Create a temporary directory
statsdir=""
trap 'rm -rf $statsdir' EXIT
statsdir=$(mktemp -d "/tmp/tmp.XXXXXXXXXX") || exit 1

# Generate one listing file per extension
awkscript='/\.[[:alnum:]]+$/ {print $0 >statsdir"/"$(NF)}'
# For Linux: stat -c "%s %n"
# For Mac: stat -f "%z %N"
find "$searchdir" -type f -exec stat -f "%z %N" {} + | \
    awk -F '.' -v statsdir="$statsdir" "$awkscript"

# Distinguish between m4a/AAC and m4a/ALAC
if [ -f "$statsdir/m4a" ]; then
    input="$statsdir/m4a"
    while IFS= read -r line; do
        filename=${line#* }
        if avprobe "$filename" 2>&1 | grep -q 'Audio: alac'; then
            echo "$line" >> "$statsdir/m4a_alac"
        else
            echo "$line" >> "$statsdir/m4a_aac"
        fi
    done < "$input"
    rm "$statsdir/m4a"
fi

# Generate and display result
{
    printf "Type Count Size\n"
    for extension in $(ls "$statsdir"); do
        count=$(wc -l "$statsdir/$extension" | cut -d ' ' -f 1)
        totalsize=$(awk '{s+=$1} END {print s}' "$statsdir/$extension")
        printf "%s %d %d\n" "$extension" "$count" "$totalsize"
    done
} | column -t

Bash – the sort order when using conditional operators

Yes, glob expansion is always sorted.
In bash (from LESS=+/'^ *Pathname Expansion' man bash)

Pathname Expansion ... the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern.

This is also specified by POSIX glob:

... The pathnames are in sort order as defined by the current setting of the LC_COLLATE category.

Note1: unless the GLOB_NOSORT flag is set. In which case the order is unspecified.

Note2: The sort order is Alphabetic (not numeric), 10 sorts before 2.

Answers:

Do I need to sort the file content (either with sort or additional bash code) ...

Globing has no relation to the file contents, only works with file names.
If you need to sort the "file contents", then, yes, you do need to call sort of use quite a bit more bash code.

... or is the glob expansion always sorted - in every environment?

Unless it is disabled with GLOB_NOSORT the result of Globing is sorted in the order defined by the collation order (variable LC_COLLATE) in the environment.

To have the same sort order you must have the same collation in effect. Both setting a LC_COLLATE variable and having a locale description that contains the same collate details.

Do the conditional operators use the same "sorting" as the expansion (or after sort)?

Yes. Both are affected in the same way by LC_COLLATE.

Would expansion or sort return file10.txt after file2.txt (in what cases?) but using conditional operators file10.txt would be before file2.txt ? What sort option would I use to fix this?

A result of 10 before 2 is "dictionary order" which is the same as what is called "alphabetic order" in the bash manual description. So, if you use bash (or any POSIX shell) to sort, that's the order you will get (in all cases). That's not wrong, so it is not fixable (for text).

However, if you choose to use sort (an external tool, outside the shell) you may ask for numeric sort (the -n option), which will place 2 before 10. Or you may extract the numbers from the text and use them to make an integer comparison (the -lt -gt integer operators) in the shell.

Are there any caveats if some of my filenames are in Unicode?

Mostly: Collation order is not fixed.

It changes with time and UNICODE version.

What may happen is that you get some surprising results in some language that you are not familiar with. For example:

"aa" would match "å" in a Danish

In short: » Be prepared to be surprised «.

Are there any issues using specific versions of bash?

Well, you must use a bash version above 2.0

respect LC_COLLATE  2.0

Does LC_COLLATE affect any of the above?

The variable LC_COLLATE affect all of the above.

Best Answer

Related Solutions

Shell – Find, count and sort all audio files. ALAC (M4A) files

Bash – the sort order when using conditional operators

Answers:

Related Question