Bash – the sort order when using conditional operators

bashshell-scriptsortwildcards

I wrote a bash script that processes a list of filenames (using glob expansion, as in for f in *) and then outputs a subset of this list to a file. Subsequently I read the content of this file into an array and perform a plain binary search for specific filenames using the obvious < and > operators for comparing strings.

Considering that I would like the script to work on as many different environments like Linux, MacOS, MinGW, … (even though it uses things like [[ and stat which are less portable), my questions are:

  1. Do I need to sort the file content (either with sort or additional bash code) or is the glob expansion always sorted – in every environment?
  2. Do the conditional operators use the same "sorting" as the expansion (or after sort)?

    Would expansion or sort return file10.txt after file2.txt (in what cases?) but using conditional operators file10.txt would be before file2.txt ? What sort option would I use to fix this?

  3. Are there any caveats if some of my filenames are in Unicode?

  4. Are there any issues using specific versions of bash?
  5. Does LC_COLLATE affect any of the above?

I obviously need the file content to match the sorting "method" of the operators in order for the binary search to work as expected…

Best Answer

Yes, glob expansion is always sorted.
In bash (from LESS=+/'^ *Pathname Expansion' man bash)

Pathname Expansion ... the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern.

This is also specified by POSIX glob:

... The pathnames are in sort order as defined by the current setting of the LC_COLLATE category.

Note1: unless the GLOB_NOSORT flag is set. In which case the order is unspecified.

Note2: The sort order is Alphabetic (not numeric), 10 sorts before 2.


Answers:

  1. Do I need to sort the file content (either with sort or additional bash code) ...

Globing has no relation to the file contents, only works with file names.
If you need to sort the "file contents", then, yes, you do need to call sort of use quite a bit more bash code.

  1. ... or is the glob expansion always sorted - in every environment?

Unless it is disabled with GLOB_NOSORT the result of Globing is sorted in the order defined by the collation order (variable LC_COLLATE) in the environment.

To have the same sort order you must have the same collation in effect. Both setting a LC_COLLATE variable and having a locale description that contains the same collate details.

  1. Do the conditional operators use the same "sorting" as the expansion (or after sort)?

Yes. Both are affected in the same way by LC_COLLATE.

  1. Would expansion or sort return file10.txt after file2.txt (in what cases?) but using conditional operators file10.txt would be before file2.txt ? What sort option would I use to fix this?

A result of 10 before 2 is "dictionary order" which is the same as what is called "alphabetic order" in the bash manual description. So, if you use bash (or any POSIX shell) to sort, that's the order you will get (in all cases). That's not wrong, so it is not fixable (for text).

However, if you choose to use sort (an external tool, outside the shell) you may ask for numeric sort (the -n option), which will place 2 before 10. Or you may extract the numbers from the text and use them to make an integer comparison (the -lt -gt integer operators) in the shell.

Are there any caveats if some of my filenames are in Unicode?

Mostly: Collation order is not fixed.

It changes with time and UNICODE version.

What may happen is that you get some surprising results in some language that you are not familiar with. For example:

"aa" would match "å" in a Danish

In short: » Be prepared to be surprised «.

Are there any issues using specific versions of bash?

Well, you must use a bash version above 2.0

respect LC_COLLATE  2.0 

Does LC_COLLATE affect any of the above?

The variable LC_COLLATE affect all of the above.

Related Question