I wrote a bash script that processes a list of filenames (using glob expansion, as in for f in *
) and then outputs a subset of this list to a file. Subsequently I read the content of this file into an array and perform a plain binary search for specific filenames using the obvious <
and >
operators for comparing strings.
Considering that I would like the script to work on as many different environments like Linux, MacOS, MinGW, … (even though it uses things like [[
and stat
which are less portable), my questions are:
- Do I need to sort the file content (either with
sort
or additional bash code) or is the glob expansion always sorted – in every environment? -
Do the conditional operators use the same "sorting" as the expansion (or after
sort
)?Would expansion or
sort
returnfile10.txt
afterfile2.txt
(in what cases?) but using conditional operatorsfile10.txt
would be beforefile2.txt
? Whatsort
option would I use to fix this? -
Are there any caveats if some of my filenames are in Unicode?
- Are there any issues using specific versions of bash?
- Does
LC_COLLATE
affect any of the above?
I obviously need the file content to match the sorting "method" of the operators in order for the binary search to work as expected…
Best Answer
Yes, glob expansion is always sorted.
In bash (from
LESS=+/'^ *Pathname Expansion' man bash
)This is also specified by POSIX glob:
Note1: unless the
GLOB_NOSORT
flag is set. In which case the order is unspecified.Note2: The sort order is Alphabetic (not numeric), 10 sorts before 2.
Answers:
Globing has no relation to the file contents, only works with file names.
If you need to sort the "file contents", then, yes, you do need to call
sort
of use quite a bit morebash
code.Unless it is disabled with
GLOB_NOSORT
the result of Globing is sorted in the order defined by the collation order (variableLC_COLLATE
) in the environment.To have the same sort order you must have the same collation in effect. Both setting a
LC_COLLATE
variable and having alocale
description that contains the same collate details.Yes. Both are affected in the same way by
LC_COLLATE
.A result of
10
before2
is "dictionary order" which is the same as what is called "alphabetic order" in the bash manual description. So, if you use bash (or any POSIX shell) to sort, that's the order you will get (in all cases). That's not wrong, so it is not fixable (for text).However, if you choose to use
sort
(an external tool, outside the shell) you may ask fornumeric
sort (the -n option), which will place2
before10
. Or you may extract the numbers from the text and use them to make an integer comparison (the-lt
-gt
integer operators) in the shell.Mostly: Collation order is not fixed.
It changes with time and UNICODE version.
What may happen is that you get some surprising results in some language that you are not familiar with. For example:
"aa" would match "å" in a Danish
In short: » Be prepared to be surprised «.
Well, you must use a bash version above 2.0
The variable
LC_COLLATE
affect all of the above.