Postgresql – How to order the results from a DISTINCT ON query

performancepostgresqlquery-performance

I have a table with the equivalent of:

group_id integer REFERENCES <some other table>
item_id integer REFERECNES <some other table>
timestamp date
data text
sort_order text 

I want to get the latest data for every item before some date. I.e. the maximum timestamp less than some value. I also want these rows sorted on sort_order.

After som Googling I find that SELECT DISTINCT ON is what I want, so I get something like

  SELECT DISTINCT ON (item_id)
         data
    FROM [this table]
   WHERE group_id = [some value]
     AND timestamp < [some value]
ORDER BY item_id, timestamp DESC

Now, I actually want the data sorted on sort_order, but DISTINCT ON requires the ORDER BY to be listed first.

One solution that comes to mind is to wrap all of this in an outer SELECT that will ORDER BY sort_order. Is that a good efficient solution, or is there a better "correct" way to do this.

(Or, am I in fact not even close…)

Best Answer

Wrapping the query in a derived table is the obvious way to solve this:

SELECT *
FROM
    ( SELECT DISTINCT ON (item_id)
             item_id, data, timestamp, sort_order, ... 
        FROM [this table]
       WHERE group_id = [some value]
         AND timestamp < [some value]
    ORDER BY item_id, timestamp DESC
   ) AS t
ORDER BY sort_order ;

If your original query is efficient and doesn't return too many rows, the additional sort will not add much cost to the query.


Besides DISTINCT ON, there are a few other methods that are often more efficient in this type of "greatest-n-pre-group" queries - but efficiency depends on many parameters, like targeted indexes and distribution of values (how many distinct item_id in the table, how many rows per item_id), Postgres version, etc.

Most notable method is using LATERAL subqueries. See these excellent answers by Erwin Brandstetter: