You should decompose your goal into several steps easier to solve. This will have two advantages:
- It will be easier to solve,
- The resulting code will be clearer and more reusable.
The scripts below basically follows these steps:
- Generate raw statistic files. An easy way is to append the file size and the file name in a temporary file named after the original file extension. So, if you have the file
/path/to/foo.mp3
which is 3000000 large, it will append 3000000 /path/to/foo.mp3
at the end of a temporary file named mp3
.
- Handle specific cases. Here it will process the temporary file
m4a
and create the two other files m4a_aac
and m4a_alac
based on the test you gave in the question.
- Generate output. All the needed information being now available, it just has to:
- Count the number of line in each temporary file to determine the number of file of this type,
- Sum up each size to get the total size of files of this type.
Here is the script:
#!/bin/sh
# This script takes the searched directory as first parameter.
# For instance: ./this-script.sh ~/Music
: ${1:?"You must pass the search directory as first parameter."}
searchdir="$1"
# Create a temporary directory
statsdir=""
trap 'rm -rf $statsdir' EXIT
statsdir=$(mktemp -d "/tmp/tmp.XXXXXXXXXX") || exit 1
# Generate one listing file per extension
awkscript='/\.[[:alnum:]]+$/ {print $0 >statsdir"/"$(NF)}'
# For Linux: stat -c "%s %n"
# For Mac: stat -f "%z %N"
find "$searchdir" -type f -exec stat -f "%z %N" {} + | \
awk -F '.' -v statsdir="$statsdir" "$awkscript"
# Distinguish between m4a/AAC and m4a/ALAC
if [ -f "$statsdir/m4a" ]; then
input="$statsdir/m4a"
while IFS= read -r line; do
filename=${line#* }
if avprobe "$filename" 2>&1 | grep -q 'Audio: alac'; then
echo "$line" >> "$statsdir/m4a_alac"
else
echo "$line" >> "$statsdir/m4a_aac"
fi
done < "$input"
rm "$statsdir/m4a"
fi
# Generate and display result
{
printf "Type Count Size\n"
for extension in $(ls "$statsdir"); do
count=$(wc -l "$statsdir/$extension" | cut -d ' ' -f 1)
totalsize=$(awk '{s+=$1} END {print s}' "$statsdir/$extension")
printf "%s %d %d\n" "$extension" "$count" "$totalsize"
done
} | column -t
Yes, glob expansion is always sorted.
In bash (from LESS=+/'^ *Pathname Expansion' man bash
)
Pathname Expansion
... the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern.
This is also specified by POSIX glob:
... The pathnames are in sort order as defined by the current setting of the LC_COLLATE category.
Note1: unless the GLOB_NOSORT
flag is set. In which case the order is unspecified.
Note2: The sort order is Alphabetic (not numeric), 10 sorts before 2.
Answers:
- Do I need to sort the file content (either with sort or additional bash code) ...
Globing has no relation to the file contents, only works with file names.
If you need to sort the "file contents", then, yes, you do need to call sort
of use quite a bit more bash
code.
- ... or is the glob expansion always sorted - in every environment?
Unless it is disabled with GLOB_NOSORT
the result of Globing is sorted in the order defined by the collation order (variable LC_COLLATE
) in the environment.
To have the same sort order you must have the same collation in effect. Both setting a LC_COLLATE
variable and having a locale
description that contains the same collate details.
- Do the conditional operators use the same "sorting" as the expansion (or after sort)?
Yes. Both are affected in the same way by LC_COLLATE
.
- Would expansion or sort return file10.txt after file2.txt (in what cases?) but using conditional operators file10.txt would be before file2.txt ? What sort option would I use to fix this?
A result of 10
before 2
is "dictionary order" which is the same as what is called "alphabetic order" in the bash manual description. So, if you use bash (or any POSIX shell) to sort, that's the order you will get (in all cases). That's not wrong, so it is not fixable (for text).
However, if you choose to use sort
(an external tool, outside the shell) you may ask for numeric
sort (the -n option), which will place 2
before 10
. Or you may extract the numbers from the text and use them to make an integer comparison (the -lt
-gt
integer operators) in the shell.
Are there any caveats if some of my filenames are in Unicode?
Mostly: Collation order is not fixed.
It changes with time and UNICODE version.
What may happen is that you get some surprising results in some language that you are not familiar with. For example:
"aa" would match "å" in a Danish
In short: » Be prepared to be surprised «.
Are there any issues using specific versions of bash?
Well, you must use a bash version above 2.0
respect LC_COLLATE 2.0
Does LC_COLLATE affect any of the above?
The variable LC_COLLATE
affect all of the above.
Best Answer
Try to pass the
-n
and-k2
command line options tosort
. I.e.,When I put your unsorted filenames into file 'data.txt' and run this command:
I get this as output:
explanation of options:
You didn't ask about this, and I didn't use it above, but it may come in handy some day.
man page for
sort
See my related post on SO about sorting according to different fields Sort by third column leaving first and second column intact (in linux) which explains more about the sort command.