Why Postgres ORDER BY Ignores Leading Underscores

natural sortorder-bypostgresqlpostgresql-9.1

I have an animal table with a name varchar(255), and I've added rows with the following values:

Piranha
__Starts With 2
Rhino
Starts With 1
0_Zebra
_Starts With 1
Antelope
_Starts With 1

When I run this query:

zoology=# SELECT name FROM animal ORDER BY name;
      name       
-----------------
0_Zebra
Antelope
Piranha
Rhino
_Starts With 1
_Starts With 1
Starts With 1
__Starts With 2
(8 rows)

Notice how the rows are sorted in an order that implies the leading _ is used to place the _Starts With 1 rows before the Starts row, but the __ in the __Starts With 2 seems to ignore this fact, as if the 2 at the end is more important than the first two characters.

Why is this?

If I sort with Python, the result is:

In  [2]: for animal in sorted(animals):
   ....:     print animal
   ....:     
0_Zebra
Antelope
Piranha
Rhino
Starts With 1
_Starts With 1
_Starts With 1
__Starts With 2

Furthermore, Python ordering suggests that underscores come after letters, which indicates that the Postgres's sorting of the first two _Starts rows before the Starts row is incorrect.

Note: I'm using Postgres 9.1.15

Here are my attempts at finding the collation:

zoology=# select datname, datcollate from pg_database;
  datname  | datcollate  
-----------+-------------
 template0 | en_US.UTF-8
 postgres  | en_US.UTF-8
 template1 | en_US.UTF-8
 zoology   | en_US.UTF-8
(4 rows)

And:

zoology=# select table_schema, 
    table_name, 
    column_name,
    collation_name
from information_schema.columns
where collation_name is not null
order by table_schema,
    table_name,
    ordinal_position;
 table_schema | table_name | column_name | collation_name 
--------------+------------+-------------+----------------
(0 rows)

Best Answer

As you haven't defined a different collation for your column in question, it uses the database-wide one, which is en_US.UTF8 - just like on my test box. I observe the exact same behaviour, take it as a consolation :)

What we see is apparently a case of the variable collation elements. Depending on the character and the collation, a number of different behaviours is possible. Here the underscore (and the hyphen and some others, too) are used only for breaking ties - 'a' and '_a' are equivalent in the first round, then the tie between them is resolved by taking the underscore into account.

If you want to sort with ignoring the underscores (and hyphens, question marks and exclamation marks in my example), you can define an ordering on an expression:

SELECT * 
FROM (VALUES ('a'), 
             ('b1'), 
             ('_a'), 
             ('-a'), 
             ('?a'), 
             ('!a1'), 
             ('a2')
     ) t (val) 
ORDER BY translate(val, '_-?!', '');

In my experiments adding a new value to the list often changes the order between otherwise equal items, showing they are treated really equal.