Linux – Why does Gnu sort sort differently on the OSX machine and Linux machine

gnulinuxlocaleosxsort

I have a OSX machine where sort runs GNU sort from coreutils 8.26 (installed from Homebrew), and a Linux machine where sort runs GNU sort from coreutils 8.25.

On the Mac:

mac$ echo -e "{1\n2" | sort
2
{1

While on Linux:

linux$ echo -e "{1\n2" | sort
{1
2

I'm aware that sort depends on the locale. I ran locale on the Linux machine, prepended each line of output with export and ran the resulting lines on the OSX machine before running (in the same terminal) the sort command again, which gave the same output as before.

I noticed, however, that running locale on the Mac doesn't show all of the lines which appear on Linux, and I'm not sure if this is related.

The locale on Linux:

linux$ locale
LANG=en_CA.UTF-8
LANGUAGE=en_CA:en
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=en_CA.UTF-8

And locale on OSX:

mac$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL="en_CA.UTF-8"

I've found that if I set LC_ALL=C on both machines, they both sort 2 before {1. But if I set LC_ALL=en_CA.UTF-8 on both machines I have the differing output as above. Same if I set LC_ALL=en_CA.utf8 on both machines. (locale -a lists en_CA.utf8 on the Linux machine but en_CA.UTF-8 on the OSX machine.)

Any idea what is going on here?

Best Answer

I did some digging on the same problem the other day, so let me share a technical answer.

On macOS, /usr/share/locale/en_US.UTF-8/LC_COLLATE (or en_CA.UTF-8, same thing) is a symlink to /usr/share/locale/la_LN.US-ASCII/LC_COLLATE, which is generated from la_LN.US-ASCII.src with colldef. Here's the entirety of la_LN.US-ASCII.src:

# ASCII
#
# $FreeBSD: src/share/colldef/la_LN.US-ASCII.src,v 1.2 1999/08/28 00:59:47 peter Exp $
#
order \
    \x00;...;\xff

You can verify that the binary LC_COLLATE file is indeed generated from la_LN.US-ASCII.src by verifying checksums:

$ colldef -o /dev/stdout usr-share-locale.tproj/colldef/la_LN.US-ASCII.src | sha256sum
9ec9b40c837860a43eb3435d7a9cc8235e66a1a72463d11e7f750500cabb5b78  -

$ sha256sum </usr/share/locale/en_US.UTF-8/LC_COLLATE
9ec9b40c837860a43eb3435d7a9cc8235e66a1a72463d11e7f750500cabb5b78  -

The ruleset is easily understandable: just compare the byte values one by one. So the collation rules for en_US.UTF-8 are the same as the POSIX locale (aka C locale). { is 0x7B, 2 is 0x32, so { comes after 2.

This ruleset is an artifact of FreeBSD 5, synced into Mac OS X 10.3 Panther. See colldef directory in FreeBSD 5.0.0 source tree. It never changed on OS X / macOS since.

On Linux, locale programs and data are part of glibc. See glibc localedata/locales tree, or /usr/share/i18n/locales on Debian/Ubuntu. If you inspect /usr/share/i18n/locales/en_US, you'll see that it pulls in iso14651_t1_common for LC_COLLATE rules. So it follows ISO 14651 rules for collation.

There are more details in the blog post: https://blog.zhimingwang.org/macos-lc_collate-hunt.

Locale names

On all current unix variants that I know of (but not on a few antiques), locale names follow the same pattern:

An ISO 639-1 lowercase two-letter language code, or an ISO 639-2 three-letter language code if the language has no two-letter code. For example, en for English, de for German, ja for Japanese, uk for Ukrainian, ber for Berber, …
For many but not all languages, an underscore _ followed by an ISO 3166 uppercase two-letter country code. Thus: en_US for US English, en_UK for British English, fr_CA Canadian (Québec) French, de_DE for German of Germany, de_AT for German of Austria, ja_JP for Japanese (of Japan), etc.
Optionally, a dot . followed by the name of a character encoding such as UTF-8, ISO-8859-1, KOI8-U, GB2312, Big5, etc. With GNU libc at least (I don't know how widespread this is), case and punctuation is ignored in encoding names. For example, zh_CN.UTF-8 is Mandarin (simplified) Chinese encoded in UTF-8, while zh_CN is Mandarin Chinese encoded in GB2312, and zh_TW is Taiwanese (traditional) Chinese encoded in Big5.
Optionally, an at sign @ followed by the name of a variant. The meaning of variants is locale-dependent. For example, many European countries have an @euro locale variant where the currency sign is € and where the encoding is one that includes this character (ISO 8859-15 or ISO 8859-16), as opposed to the unadorned variant with the older currency sign. For example, en_IE (English, Ireland) uses the latin1 (ISO 8859-1) encoding and £ as the currency symbol while en_IE@euro uses the latin9 (ISO 8859-15) encoding and € as the currency symbol.

In addition, there are two locale names that exist on all unix-like system: C and POSIX. These names are synonymous and mean computerese, i.e. default settings that are appropriate for data that is parsed by a computer program.

Locale settings

The following locale categories are defined by POSIX:

LC_CTYPE: the character set used by terminal applications: classification data (which characters are letters, punctuation, spaces, invalid, etc.) and case conversion. Text utilities typically heed LC_CTYPE to determine character boundaries.
LC_COLLATE: collation (i.e. sorting) order. This setting is of very limited use for several reasons:
- Most languages have intricate rules that depend on what is being sorted (e.g. dictionary words and proper names might not use the same order) and cannot be expressed by LC_COLLATE.
- There are few applications where proper sort order matters which are performed by software that uses locale settings. For example, word processors store the language and encoding of a file in the file itself (otherwise the file wouldn't be processed correctly on a system with different locale settings) and don't care about the locale settings specified by the environment.
- LC_COLLATE can have nasty side effects, in particular because it causes the sort order A < a < B < …, which makes “between A and Z” include the lowercase letters a through y. In particular, very common regular expressions like [A-Z] break some applications.
LC_MESSAGES: the language of informational and error messages.
LC_NUMERIC: number formatting: decimal and thousands separator.
Many applications hard-code . as a decimal separator. This makes LC_NUMERIC not very useful and potentially dangerous:
- Even if you set it, you'll still see the default format pretty often.
- You're likely to get into a situation where one application produces locale-dependent output and another application expects . to be the decimal point, or , to be a field separator.
LC_MONETARY: like LC_NUMERIC, but for amounts of local currency.
Very few applications use this.
LC_TIME: date and time formatting: weekday and month names, 12 or 24-hour clock, order of date parts, punctuation, etc.

GNU libc, which you'll find on non-embedded Linux, defines additional locale categories:

LC_PAPER: the default paper size (defined by height and width).
LC_NAME, LC_ADDRESS, LC_TELEPHONE, LC_MEASUREMENT, LC_IDENTIFICATION: I don't know of any application that uses these.

Environment variables

Applications that use locale settings determine them from environment variables.

Then the value of the LANG environment variable is used unless overridden by another setting. If LANG is not set, the default locale is C.
The LC_xxx names can be used as environment variables.
If LC_ALL is set, then all other values are ignored; this is primarily useful to set LC_ALL=C run applications that need to produce the same output regardless of where they are run.
In addition, GNU libc uses LANGUAGE to define fallbacks for LC_MESSAGES (e.g. LANGUAGE=fr_BE:fr_FR:en to prefer Belgian French, or if unavailable France French, or if unavailable English).

Installing locales

Locale data can be large, so some distributions don't ship them in a usable form and instead require an additional installation step.

On Debian, to install locales, run dpkg-reconfigure locales and select from the list in the dialog box, or edit /etc/locale.gen and then run locale-gen.
On Ubuntu, to install locales, run locale-gen with the names of the locales as arguments.

You can define your own locale.

Recommendation

The useful settings are:

Set LC_CTYPE to the language and encoding that you encode your text files in. Ensure that your terminals use that encoding.
For most languages, only the encoding matters. There are a few exceptions; for example, an uppercase i is I in most languages but İ in Turkish (tr_TR).
Set LC_MESSAGES to the language that you want to see messages in.
Set LC_PAPER to en_US if you want US Letter to be the default paper size and just about anything else (e.g. en_GB) if you want A4.
Optionally, set LC_TIME to your favorite time format.

As explained above, avoid setting LC_COLLATE and LC_NUMERIC. If you use LANG, explicitly override these two categories by setting them to C.

Why GNU find -execdir command behave differently than BSD find

It's not an endless looping, it's just GNU find reporting that echo died of a SIGPIPE (because the other end of the pipe on stdout has been closed when head died).

-execdir is not specified by POSIX. And even for -exec, there's nothing in the POSIX spec that says that if the command is killed by a SIGPIPE, find should exit.

So, would POSIX specify -execdir, gfind would probably be more POSIX conformant than your BSD find (assuming your BSD find exits upon its child dying of a SIGPIPE as the wording of your question suggests, FreeBSD find doesn't in my tests and does run echo in a loop for every file (like for GNU find, not endless)).

You may say that for most common cases, find exiting upon a child dying of SIGPIPE would be preferable, but the -executed command could still die of a SIGPIPE for other reasons than the pipe on stdout being closed, so exiting find for that would be borderline acceptable.

With GNU find, you can tell find to quit if a command fails with:

find . ... \( -exec echo {} \; -o -quit \)

As to whether a find implementation is allowed or forbidden to report children dying of a signal on stderr, here (with the usage of -execdir) we're outside the scope of POSIX anyway, but if -exec was used in place of -execdir, it seems that would be a case where gfind is not conformant.

The spec for find says: "the standard error shall be used only for diagnostic messages" but also says there:

Default Behavior: When this section is listed as "The standard error shall be used only for diagnostic messages.", it means that, unless otherwise stated, the diagnostic messages shall be sent to the standard error only when the exit status indicates that an error occurred and the utility is used as described by this volume of POSIX.1-2008.

Which would indicate that since find doesn't return with a non-zero exit status in that case, it should not output that message on stderr.

Note that by that text, both GNU and FreeBSD find would be non-compliant in a case like:

$ find /dev/null -exec blah \;; echo "$?"
find: `blah': No such file or directory
0

where both report an error without settng the exit-status to non-zero. Which is why I raised the question on the austin-group (the guys behind POSIX) mailing list.

Note that if you change your command to:

(trap '' PIPE; find -L /etc -execdir echo {} \; | head)

echo will still be run for every file, will still fail, but this time, it will be echo reporting the error message.

Now about filename vs /etc/filename vs ./filename being displayed.

Again, -execdir being not a standard option, there's no text that says who's right and who's wrong. -execdir was introduced by BSD find and copied later by GNU find.

GNU find has done some intentional changes (improvements) over it. For instance, it prepends file names with ./ in the arguments passed to commands. That means that find . -execdir cmd {} \; doesn't have a problem with filenames starting with - for instance.

The fact that -L -execdir doesn't pass a filepath relative to the parent directory is actually a bug that affects version 4.3.0 to 4.5.8 of GNU find. It was fixed in 4.5.9, but that's on the development branch and there hasn't been a new stable release since (as of 2015-12-22, though one is imminent).

More info at the findutils mailing list.

If all you want is print the base name of every file in /etc portably, you can just do:

find -L /etc -exec basename {} \;

Or more efficiently:

find -L ///etc | awk -F / '/\/\// && NR>1 {print last}
                          {if (NF > 1) last = $NF
                           else last = last "\n" $NF}
                          END {if (NR) print last}'

which you can simplify to

find -L /etc | awk -F / '{print $NF}'

if you can guarantee file paths don't contain newline characters (IIRC, some versions of OS/X had such files in /etc though).

GNUly:

find -L /etc -printf '%f\n'

As to whether:

find -exec echo {} \;

in the link you're referring to, is POSIX or not.

No, as a command invocation, that is not POSIX. A script that would have that would be non-compliant.

POSIX find requires that at least one path be given, but leaves the behaviour unspecified if the first non-option argument of find starts with - or is a find predicate (like !, or (), so GNU find behaviour is compliant, so are implementations that report an error (or treat the first argument as a file path even if it represents a find predicate) or spray red paint at your face, there's no reason POSIXLY_CORRECT would affect the find behaviour there.