Bash Shell – Does the Star * Wildcard Always Produce a Sorted List?

bashshellwildcards

I have a directory filled with files with names like logXX where XX is a two-character, zero-padded, uppercase hex number such as:

log00
log01
log02
...
log0A
log0B
log0C
...
log4E
log4F
log50
...

Generally there will be fewer than say 20 or 30 files total. The date and time on my particular system is not something that can be relied up on (an embedded system with no reliable NTP or GPS time sources). However the filenames will reliably increment as shown above.

I wish to grep through all the files for the single most recent log entry of a certain type, I was hoping to cat the files together such as…

cat /tmp/logs/log* | grep 'WARNING 07 -' | tail -n1

However it occurred to me that different versions of bash or sh or zsh etc. might have different ideas about how the * is expanded.

The man bash page doesn't say whether or not the expansion of * would be a definitely ascending alphabetical list of matching filenames. It does seem to be ascending every time I've tried it on all the systems I have available to me — but is it DEFINED behaviour or just implementation specific?

In other words can I absolutely rely on cat /tmp/logs/log* to concatenate all my log files together in alphabetical order?

Best Answer

In all shells, globs are sorted by default. They were already by the /etc/glob helper called by Ken Thompson's shell to expand globs in the first version of Unix in the early 70s (and which gave globs their name).

For sh, POSIX does require them to be sorted by way of strcoll(), that is using the sorting order in the user's locale, like for ls though some still do it via strcmp(), that is based on byte values only.

$ dash -c 'echo *'
Log01B log-0D log00 log01 log02 log0A log0B log0C log4E log4F log50 log① log② lóg01
$ bash -c 'echo *'
log① log② log00 log01 lóg01 Log01B log02 log0A log0B log0C log-0D log4E log4F log50
$ zsh -c 'echo *'
log① log② log00 log01 lóg01 Log01B log02 log0A log0B log0C log-0D log4E log4F log50
$ ls
log②  log①  log00  log01  lóg01  Log01B  log02  log0A  log0B  log0C  log-0D  log4E  log4F  log50
$ ls | sort
log②
log①
log00
log01
lóg01
Log01B
log02
log0A
log0B
log0C
log-0D
log4E
log4F
log50

You may notice above that for those shells that do sorting based on locale, here on a GNU system with a en_GB.UTF-8 locale, the - in the file names is ignored for sorting (most punctuation characters would). The ó is sorted in a more expected way (at least to British people), and case is ignored (except when it comes to decide ties).

However, you'll notice some inconsistencies for log① log②. That's because the sorting order of ① and ② is not defined in GNU locales (currently; hopefully it will be fixed some day). They sort the same, so you get random results.

Changing the locale will affect the sorting order. You can set the locale to C to get a strcmp()-like sort:

$ bash -c 'echo *'
log① log② log00 log01 lóg01 Log01B log02 log0.2 log0A log0B log0C log-0D log4E log4F log50
$ bash -c 'LC_ALL=C; echo *'
Log01B log-0D log0.2 log00 log01 log02 log0A log0B log0C log4E log4F log50 log① log② lóg01

Note that some locales can cause some confusions even for all-ASCII all-alnum strings. Like Czech ones (on GNU systems at least) where ch is a collating element that sorts after h:

$ LC_ALL=cs_CZ.UTF-8 bash -c 'echo *'
log0Ah log0Bh log0Dh log0Ch

Or, as pointed out by @ninjalj, even weirder ones in Hungarian locales:

$ LC_ALL=hu_HU.UTF-8 bash -c 'echo *'
logX LOGx LOGX logZ LOGz LOGZ logY LOGY LOGy

In zsh, you can choose the sorting with glob qualifiers. For instance:

echo *(om) # to sort by modification time
echo *(oL) # to sort by size
echo *(On) # for a *reverse* sort by name
echo *(o+myfunction) # sort using a user-defined function
echo *(N)  # to NOT sort
echo *(n)  # sort by name, but numerically, and so on.

The numeric sort of echo *(n) can also be enabled globally with the numericglobsort option:

$ zsh -c 'echo *'
log① log② log00 log01 lóg01 Log01B log02 log0.2 log0A log0B log0C log-0D log4E log4F log50
$ zsh -o numericglobsort -c 'echo *'
log① log② log00 lóg01 Log01B log0.2 log0A log0B log0C log01 log02 log-0D log4E log4F log50

If you (as I was) are confused by that order in that particular instance (here using my British locale), see here for details.

Related Question