Linux – Why does Gnu sort sort differently on the OSX machine and Linux machine

gnulinuxlocaleosxsort

I have a OSX machine where sort runs GNU sort from coreutils 8.26 (installed from Homebrew), and a Linux machine where sort runs GNU sort from coreutils 8.25.

On the Mac:

mac$ echo -e "{1\n2" | sort
2
{1

While on Linux:

linux$ echo -e "{1\n2" | sort
{1
2

I'm aware that sort depends on the locale. I ran locale on the Linux machine, prepended each line of output with export and ran the resulting lines on the OSX machine before running (in the same terminal) the sort command again, which gave the same output as before.

I noticed, however, that running locale on the Mac doesn't show all of the lines which appear on Linux, and I'm not sure if this is related.

The locale on Linux:

linux$ locale
LANG=en_CA.UTF-8
LANGUAGE=en_CA:en
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=en_CA.UTF-8

And locale on OSX:

mac$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL="en_CA.UTF-8"

I've found that if I set LC_ALL=C on both machines, they both sort 2 before {1. But if I set LC_ALL=en_CA.UTF-8 on both machines I have the differing output as above. Same if I set LC_ALL=en_CA.utf8 on both machines. (locale -a lists en_CA.utf8 on the Linux machine but en_CA.UTF-8 on the OSX machine.)

Any idea what is going on here?

Best Answer

I did some digging on the same problem the other day, so let me share a technical answer.


On macOS, /usr/share/locale/en_US.UTF-8/LC_COLLATE (or en_CA.UTF-8, same thing) is a symlink to /usr/share/locale/la_LN.US-ASCII/LC_COLLATE, which is generated from la_LN.US-ASCII.src with colldef. Here's the entirety of la_LN.US-ASCII.src:

# ASCII
#
# $FreeBSD: src/share/colldef/la_LN.US-ASCII.src,v 1.2 1999/08/28 00:59:47 peter Exp $
#
order \
    \x00;...;\xff

You can verify that the binary LC_COLLATE file is indeed generated from la_LN.US-ASCII.src by verifying checksums:

$ colldef -o /dev/stdout usr-share-locale.tproj/colldef/la_LN.US-ASCII.src | sha256sum
9ec9b40c837860a43eb3435d7a9cc8235e66a1a72463d11e7f750500cabb5b78  -

$ sha256sum </usr/share/locale/en_US.UTF-8/LC_COLLATE
9ec9b40c837860a43eb3435d7a9cc8235e66a1a72463d11e7f750500cabb5b78  -

The ruleset is easily understandable: just compare the byte values one by one. So the collation rules for en_US.UTF-8 are the same as the POSIX locale (aka C locale). { is 0x7B, 2 is 0x32, so { comes after 2.

This ruleset is an artifact of FreeBSD 5, synced into Mac OS X 10.3 Panther. See colldef directory in FreeBSD 5.0.0 source tree. It never changed on OS X / macOS since.


On Linux, locale programs and data are part of glibc. See glibc localedata/locales tree, or /usr/share/i18n/locales on Debian/Ubuntu. If you inspect /usr/share/i18n/locales/en_US, you'll see that it pulls in iso14651_t1_common for LC_COLLATE rules. So it follows ISO 14651 rules for collation.


There are more details in the blog post: https://blog.zhimingwang.org/macos-lc_collate-hunt.

Related Question