PostgreSQL Collation – Different Sort Despite Same Collation

collationpostgresql

I am getting different sort results for items with diacritics even though the collation being used is apparently the same.

So, after installing Postgress.app in my mac box for development I create the following table:

create table names (
    id int primary key,
    name varchar
)

and then insert a few values

insert into names values (1, "Francisco");
insert into names values (2, "Ana");
insert into names values (3, "Ágata");

and then query them like so:

select name from names order by name;

I get a wrong sorting order like so:

Ana
Francisco
Ágata

I can confirm with

select datname, 
       datcollate
from pg_database;

that the datcollate value for all databases is en_US.UTF-8.

To help me debug this problem, I quickly spinned up an RDS instance, with same data, query and collation, but this time I get the proper expected sort order of:

Ágata
Ana
Francisco

What am I missing here, and how can I configuration postgres in my dev box to behave as expected?

Best Answer

PostgreSQL uses the operating system for its locales (at least with libc as the collation provider, otherwise icu may be used), and unfortunately, different operating systems give inconsistent sort results for the same UTF-8 locale names.

The fact that UTF-8 collations with Postgres.app doesn't sort as expected is an old open item on their issue tracker: Problems with sort order (UTF8 locales don't work). It can't really be fixed unless Apple fixes macOS, and so far it's not happening. See Ordering differences between Postgres instances on different machines (same locale) for a 2012 question on SO mentioning the problem.

What am I missing here, and how can I configuration postgres in my dev box to behave as expected?

Your options:

  • use the same operating system for dev and production, possibly virtualized.

  • use ICU collations. They sort the same across all operating systems, but as of PG version 12, they can be used only with explicit COLLATE clauses at the column or expression level (not at the database level).

  • use the "C" or "C.UTF-8" collations, which sort binary-wise by codepoints. They don't provide a linguistically-aware sort order, but they're faster, immutable and consistent across operating systems.