Is Row Ordering Preserved in PostgreSQL Functions and CTEs?

order-bypostgresql

In SELECT statements the ordering of the returned rows is not guaranteed if the ORDER BY clause is not specified. This is true for "normal" tables.

Is this also true for ordered tables generated using WITH expressions (CTEs)? Ordered tables returned by functions? I assume that not. Is this somewhere explicitly stated in the documentation?

Specifically, can I assume that this (more efficient) query:

WITH ordered AS ( SELECT * FROM table1 ORDER BY col1 )
SELECT sum(col2) result FROM
generate_series(0,50) nr,
LATERAL (SELECT * FROM ordered LIMIT 100 OFFSET nr*100) a
GROUP BY nr ORDER BY nr;

will be equivalent to this query:

SELECT sum(col2) result FROM
generate_series(0,50) nr,
LATERAL (SELECT * FROM table1 ORDER BY col1 LIMIT 100 OFFSET nr*100) a
GROUP BY nr ORDER BY nr;

For functions:

If I have such function:

CREATE FUNCTION do_sort(name text[]) RETURNS TABLE(name text) AS $$
SELECT name.name FROM unnest($1) name ORDER BY name.name ASC;
$$ LANGUAGE SQL IMMUTABLE;

can I make such wrapper function and assume that the array will always be correctly sorted?

CREATE FUNCTION do_sort_returns_array(name text[]) RETURNS text[] AS $$
SELECT array_agg(name) FROM do_sort($1);
$$ LANGUAGE SQL IMMUTABLE;

Best Answer

In SELECT statements the ordering of the returned rows is not guaranteed if the ORDER BY clause is not specified. This is true for all tables, simple or complicated queries.

Now, having that as basis, we should consider that Postgres has implemented CTEs in a peculiar way. They are always materialized (see: PostgreSQL’s CTEs are optimisation fences).

That doesn't mean that you should rely on this. When using a CTE, like in the first query, an ORDER BY should be added as well when referencing the CTE, as the ORDER BY inside the CTE may be removed (I'm not saying that it will always be removed as my tests show that a sort is performed. But it may be removed in a future optimizer change, as an ORDER BY without LIMIT is redundant):

WITH ordered AS ( SELECT * FROM table1 ORDER BY col1 )    -- redundant ORDER BY
SELECT sum(col2) result FROM
generate_series(0,50) nr,
LATERAL (SELECT * FROM ordered ORDER BY col1        -- ORDER BY added
         LIMIT 100 OFFSET nr*100) a
GROUP BY nr ORDER BY nr;

Update, from Craig Ringer, after a comment I made in his above linked blog post:

It’s like quite a few other cases – right now PostgreSQL will always return the rows in the order they’re output by the CTE, but you shouldn’t technically rely on it. Who knows what future features could change that?

Unfortunately IIRC it’s also not smart enough to recognise that the rows are ordered correctly if you add another ORDER BY in the outer query. (Haven’t tested, but pretty sure). So it’s not necessarily free to do it the “right” way.

So, if you want the results ordered in a specific ordering, add an ORDER BY in the final SELECT. You could get away without it now (and you might get some slight performance gain) but there is no guarantee that this will not change in a future release.

Another reason for explicitly using ORDER BY is that you may not be always maintaining the code you write now. Another developer may try to reorganize the query without CTEs, using derived tables or LATERAL joins. They would need to know that your query relies in the order provided by the CTE so it needs to be either used in the final SELECT or commented or documented somewhere.


Now, another way to write the query would be adding a LIMIT in the CTE and using ROW_NUMBER():

WITH ordered AS 
    ( SELECT col2, (ROW_NUMBER() OVER (ORDER BY col1) - 1) / 100 AS nr 
      FROM table1 
      ORDER BY col1 LIMIT 5100 )
SELECT n.nr, sum(o.col2) AS result 
FROM generate_series(0, 50) AS n (nr)
    LEFT JOIN ordered AS o
      ON n.nr = o.nr 
GROUP BY n.nr
ORDER BY n.nr ;

This way, the CTE will only be materialized for the 5100 rows you want and not for all (possibly millions) rows of the table. Even better if there is an index on (col1, col2). You could even remove the generate_series() and/or remove the CTE:

SELECT o.nr, sum(o.col2) AS result 
FROM 
    ( SELECT col2, (ROW_NUMBER() OVER (ORDER BY col1) - 1) / 100 AS nr 
      FROM table1 
      ORDER BY col1 LIMIT 5100 
    ) AS o
GROUP BY o.nr
ORDER BY o.nr ;