Why Postgres ORDER BY Ignores Leading Underscores

natural sortorder-bypostgresqlpostgresql-9.1

I have an animal table with a name varchar(255), and I've added rows with the following values:

Piranha
__Starts With 2
Rhino
Starts With 1
0_Zebra
_Starts With 1
Antelope
_Starts With 1

When I run this query:

zoology=# SELECT name FROM animal ORDER BY name;
      name       
-----------------
0_Zebra
Antelope
Piranha
Rhino
_Starts With 1
_Starts With 1
Starts With 1
__Starts With 2
(8 rows)

Notice how the rows are sorted in an order that implies the leading _ is used to place the _Starts With 1 rows before the Starts row, but the __ in the __Starts With 2 seems to ignore this fact, as if the 2 at the end is more important than the first two characters.

Why is this?

If I sort with Python, the result is:

In  [2]: for animal in sorted(animals):
   ....:     print animal
   ....:     
0_Zebra
Antelope
Piranha
Rhino
Starts With 1
_Starts With 1
_Starts With 1
__Starts With 2

Furthermore, Python ordering suggests that underscores come after letters, which indicates that the Postgres's sorting of the first two _Starts rows before the Starts row is incorrect.

Note: I'm using Postgres 9.1.15

Here are my attempts at finding the collation:

zoology=# select datname, datcollate from pg_database;
  datname  | datcollate  
-----------+-------------
 template0 | en_US.UTF-8
 postgres  | en_US.UTF-8
 template1 | en_US.UTF-8
 zoology   | en_US.UTF-8
(4 rows)

And:

zoology=# select table_schema, 
    table_name, 
    column_name,
    collation_name
from information_schema.columns
where collation_name is not null
order by table_schema,
    table_name,
    ordinal_position;
 table_schema | table_name | column_name | collation_name 
--------------+------------+-------------+----------------
(0 rows)

Best Answer

As you haven't defined a different collation for your column in question, it uses the database-wide one, which is en_US.UTF8 - just like on my test box. I observe the exact same behaviour, take it as a consolation :)

What we see is apparently a case of the variable collation elements. Depending on the character and the collation, a number of different behaviours is possible. Here the underscore (and the hyphen and some others, too) are used only for breaking ties - 'a' and '_a' are equivalent in the first round, then the tie between them is resolved by taking the underscore into account.

If you want to sort with ignoring the underscores (and hyphens, question marks and exclamation marks in my example), you can define an ordering on an expression:

SELECT * 
FROM (VALUES ('a'), 
             ('b1'), 
             ('_a'), 
             ('-a'), 
             ('?a'), 
             ('!a1'), 
             ('a2')
     ) t (val) 
ORDER BY translate(val, '_-?!', '');

In my experiments adding a new value to the list often changes the order between otherwise equal items, showing they are treated really equal.

Related Solutions

SQL Server – Why ORDER BY Does Not Belong in a View

(Indexed views aside, of course.)

A view is not materialized - the data isn't stored, so how could it be sorted? A view is kind of like a stored procedure that just contains a SELECT with no parameters... it doesn't hold data, it just holds the definition of the query. Since different references to the view could need data sorted in different ways, the way that you do this - just like selecting from a table, which is also an unsorted collection of rows, by definition - is to include the order by on the outer query.

Also to give a little insight into the history. You could never put ORDER BY in a view, without also including TOP. And in this case the ORDER BY dictated which rows were included by TOP, not how they would be presented. It just so happened that in SQL Server 2000, if TOP was 100 PERCENT or {some number >= number of rows in the table}, the optimizer was fairly simplistic and it ended up producing a plan with a sort that matched the TOP/ORDER BY. But this behavior was never guaranteed or documented - it was just relied upon based on observation, which is a bad habit. When SQL Server 2005 came out, this behavior started "breaking" because of changes in the optimizer that led to different plans and operators being used - among other things, the TOP / ORDER BY would be ignored completely if it was TOP 100 PERCENT. Some customers complained about this so loudly that Microsoft issued a trace flag to reinstate the old behavior. I'm not going to tell you what the flag is because I don't want you to use it and I want to make sure that the intent is correct - if you want a predictable sort order, use ORDER BY on the outer query.

To summarize and just as much to clarify a point you made: Microsoft didn't remove anything. They made the product better, and as a side effect this undocumented, non-guaranteed behavior became less reliable. Overall, I think the product is better for it.

Postgresql – Postgres: “Pivot” unioned table based on one column

If it is faster you have to measure for yourself. However, doing it on the db side sends less data across, so I would assume it to be faster.

The pivot itself is fairly simple. I put your query's result in a table to make the example simpler.

SQL Fiddle

PostgreSQL 9.1.9 Schema Setup:

CREATE TABLE your_query
    ("rel_id" int, "timestamp" timestamp, "y" varchar(1))
;

INSERT INTO your_query
    ("rel_id", "timestamp", "y")
VALUES
    (1, '2013-01-01 00:00:00', 'a'),
    (1, '2013-01-02 00:00:00', 'b'),
    (1, '2013-01-03 00:00:00', 'c'),
    (1, '2013-01-04 00:00:00', 'd'),
    (2, '2013-01-01 00:00:00', 'e'),
    (2, '2013-01-04 00:00:00', 'f'),
    (2, '2013-01-06 00:00:00', 'g')
;

First step is to only return one row per date. That is simply done with a group by:

Query 1:

SELECT timestamp
  FROM your_query
 GROUP BY timestamp
 ORDER BY timestamp

Results:

|                      TIMESTAMP |
|--------------------------------|
| January, 01 2013 00:00:00+0000 |
| January, 02 2013 00:00:00+0000 |
| January, 03 2013 00:00:00+0000 |
| January, 04 2013 00:00:00+0000 |
| January, 06 2013 00:00:00+0000 |

Now wee need to pull the "correct" value into each column. For that we combine an aggregate with a case. The case returns null for all rows for which the condition is not met. the aggregate ignores nulls. That leaves the one value we are looking for:

Query 2:

SELECT timestamp,
       MAX(CASE WHEN rel_id = 1 THEN y END ) AS "1",
       MAX(CASE WHEN rel_id = 2 THEN y END ) AS "2"
  FROM your_query
 GROUP BY timestamp
 ORDER BY timestamp

Results:

|                      TIMESTAMP |      1 |      2 |
|--------------------------------|--------|--------|
| January, 01 2013 00:00:00+0000 |      a |      e |
| January, 02 2013 00:00:00+0000 |      b | (null) |
| January, 03 2013 00:00:00+0000 |      c | (null) |
| January, 04 2013 00:00:00+0000 |      d |      f |
| January, 06 2013 00:00:00+0000 | (null) |      g |

To make this work with your original query just replace your_query in my example with

(
(SELECT rel_id, timestmap, y FROM table_1 AS full_
WHERE full_.timestamp BETWEEN %s AND %s
ORDER BY full_.timestamp)

UNION ALL

(SELECT rel_id, timestamp, y FROM table_2 AS full_
WHERE full_.timestamp BETWEEN %s AND %s
ORDER BY full_.timestamp)

UNION ALL

...
) AS your_query

Best Answer

Related Solutions

SQL Server – Why ORDER BY Does Not Belong in a View

Postgresql – Postgres: “Pivot” unioned table based on one column

Related Question