Postgresql – How to properly implement compound greatest-n filtering

greatest-n-per-groupperformancepostgresqlpostgresql-performance

Yep, more greatest-n-per-group questions.

Given the a table releases with the following columns:

 id         | primary key                 | 
 volume     | double precision            |
 chapter    | double precision            |
 series     | integer-foreign-key         |
 include    | boolean                     | not null

I want to select the compound max of volume, then chapter for a set of series.

Right now, if I query per-distinct-series, I can easily accomplish this as follows:

SELECT 
       releases.chapter AS releases_chapter,
       releases.include AS releases_include,
       releases.series AS releases_series
FROM releases
WHERE releases.series = 741
  AND releases.include = TRUE
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST LIMIT 1;

However, if I have a large set of series (and I do), this quickly runs into efficiency issues where I'm issuing 100+ queries to generate a single page.

I'd like to roll the whole thing into a single query, where I can simply say WHERE releases.series IN (1,2,3....), but I haven't figured out how to convince Postgres to let me do that.

The naive approach would be:

SELECT releases.volume AS releases_volume,
       releases.chapter AS releases_chapter,
       releases.series AS releases_series
FROM 
    releases
WHERE 
    releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499, 
                        556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137, 
                        1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768, 
                        1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255, 
                        2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606, 
                        2634, 2636, 2695, 2696 )
  AND releases.include = TRUE
GROUP BY 
    releases_series
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST;

Which obviously doesn't work:

ERROR:  column "releases.volume" must appear in the 
        GROUP BY clause or be used in an aggregate function

Without the GROUP BY, it does fetch everything, and with some simple procedural filtering it would even work, but there must be a "proper" way to do this in SQL.

Following the errors, and adding aggregates:

SELECT max(releases.volume) AS releases_volume,
       max(releases.chapter) AS releases_chapter,
       releases.series AS releases_series
FROM 
    releases
WHERE 
    releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499, 
                        556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137, 
                        1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768, 
                        1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255, 
                        2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606, 
                        2634, 2636, 2695, 2696 )
  AND releases.include = TRUE
GROUP BY 
    releases_series;

Mostly works, but the issue is that the two maximums aren't coherent. If I have two rows, one where volume:chapter are 1:5, and 4:1, I need to return 4:1, but the independent maximums return 4:5.

Frankly, this would be so simple to implement in my application code that I have to be missing something obvious here. How can I implement a query that actually satisfies my requirements?

Best Answer

The simple solution in Postgres is with DISTINCT ON:

SELECT DISTINCT ON (r.series)
       r.volume  AS releases_volume
     , r.chapter AS releases_chapter
     , r.series  AS releases_series
FROM   releases r
WHERE  r.series IN (
    12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499
  , 556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137
  , 1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768
  , 1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255
  , 2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606
  , 2634, 2636, 2695, 2696)
AND    r.include
ORDER  BY r.series, r.volume DESC NULLS LAST, r.chapter DESC NULLS LAST;

Details:

Select first row in each GROUP BY group?

Depending on data distribution there may be faster techniques:

Optimize GROUP BY query to retrieve latest record per user

Also, there are faster alternatives for long lists than IN ().

Combining an unnested array with a LATERAL join:

SELECT r.*
FROM   unnest('{12, 17, 44, 79, 88, 110, 129}'::int[]) t(i)  -- or many more items
     , LATERAL (
   SELECT volume  AS releases_volume
        , chapter AS releases_chapter
        , series  AS releases_series
   FROM   releases
   WHERE  series = t.i 
   AND    include
   ORDER  BY series, volume DESC NULLS LAST, chapter DESC NULLS LAST
   LIMIT  1
   ) r;

Is often faster. For best performance you need a matching multicolumn index like:

CREATE INDEX releases_series_volume_chapter_idx
ON releases(series, volume DESC NULLS LAST, chapter DESC NULLS LAST);

Extremely slow query on indexed column

And if there are more than a few rows where include is not true, while you are only interested in the rows with include = true, then consider a partial multicolumn index:

CREATE INDEX releases_series_volume_chapter_idx
ON releases(series, volume DESC NULLS LAST, chapter DESC NULLS LAST)
WHERE include;

Index

A plain multicolumn B-tree index should work after all:

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);

Why DESC NULLS LAST?

Unused index in range of dates query

It's safe to assume you have an equipment table? Then performance won't be a problem:

Correlated subquery

Based on this equipment table, run a lowly correlated subquery to great effect:

SELECT equipment_id
     ,(SELECT created_at
       FROM   geoposition_records
       WHERE  equipment_id = eq.equipment_id
       ORDER  BY created_at DESC NULLS LAST
       LIMIT  1) AS latest
FROM   equipment eq;

For a small number of rows in the equipment table (57 judging from your EXPLAIN ANALYZE output), that's very fast.

`LATERAL` join in Postgres 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;

Detailed explanation:

Optimize GROUP BY query to retrieve latest record per user

Performance similar to the correlated subquery.

Function

If you can't talk sense into the query planner (which shouldn't occur), a function looping through the equipment table is certain to do the trick. Looking up one equipment_id at a time uses the index.

CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   FOR equipment_id IN
      SELECT e.equipment_id FROM equipment e ORDER BY 1
   LOOP
      SELECT g.created_at
      FROM   geoposition_records g
      WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
      ORDER  BY g.created_at DESC NULLS LAST
      LIMIT  1
      INTO   latest;

      RETURN NEXT;
   END LOOP;
END  
$func$;

Makes for a nice call, too:

SELECT * FROM f_latest_equip();

Performance comparison:

db<>fiddle here
_{OLD sqlfiddle}

Postgresql – Speed up SELECT with WINDOW, (PARTITION … ORDER) with compound index

The answer is that Postgres does use the index, if you have enough records. This was against my development database, which is relatively empty. In production without the index, I get this from EXPLAIN ANALYZE:

HashAggregate  (cost=186849.60..187449.78 rows=200061 width=48) (actual time=129567.598..129568.987 rows=1903 loops=1)
  ->  WindowAgg  (cost=185149.08..186349.45 rows=200061 width=48) (actual time=129182.770..129460.868 rows=186023 loops=1)
        ->  Sort  (cost=185149.08..185249.11 rows=200061 width=48) (actual time=129181.497..129228.725 rows=186023 loops=1)
              Sort Key: book_id, list_name
              Sort Method: quicksort  Memory: 25441kB
              ->  Bitmap Heap Scan on ranking_points  (cost=3813.71..181625.99 rows=200061 width=48) (actual time=67.821..128210.717 rows=186023 loops=1)
                    Recheck Cond: (book_id = ANY ('{61,62,63,64,66,70,78,270,301,298,398,402,414,15485,15416,2767,6922,6920,6974,18002,9122,15444,65,1774,4939,18331,22828,22209,19841,22402,22841,18232}'::integer[]))
                    ->  Bitmap Index Scan on index_ranking_points_on_book_id_and_created_at  (cost=0.00..3803.70 rows=200061 width=0) (actual time=55.568..55.568 rows=186023 loops=1)
                          Index Cond: (book_id = ANY ('{61,62,63,64,66,70,78,270,301,298,398,402,414,15485,15416,2767,6922,6920,6974,18002,9122,15444,65,1774,4939,18331,22828,22209,19841,22402,22841,18232}'::integer[]))
Total runtime: 129573.021 ms

With the index, I get this:

HashAggregate  (cost=193464.14..194108.37 rows=214744 width=48) (actual time=1446.685..1447.571 rows=1903 loops=1)
  ->  WindowAgg  (cost=191638.82..192927.28 rows=214744 width=48) (actual time=1091.394..1349.506 rows=186027 loops=1)
        ->  Sort  (cost=191638.82..191746.19 rows=214744 width=48) (actual time=1091.381..1136.297 rows=186027 loops=1)
              Sort Key: book_id, list_name
              Sort Method: quicksort  Memory: 25441kB
              ->  Bitmap Heap Scan on ranking_points  (cost=3772.47..187835.22 rows=214744 width=48) (actual time=58.064..221.151 rows=186027 loops=1)
                    Recheck Cond: (book_id = ANY ('{61,62,63,64,66,70,78,270,301,298,398,402,414,15485,15416,2767,6922,6920,6974,18002,9122,15444,65,1774,4939,18331,22828,22209,19841,22402,22841,18232}'::integer[]))
                    ->  Bitmap Index Scan on index_ranking_points_on_book_id_list_name_and_created_at_desc  (cost=0.00..3761.74 rows=214744 width=0) (actual time=44.349..44.349 rows=186027 loops=1)
                          Index Cond: (book_id = ANY ('{61,62,63,64,66,70,78,270,301,298,398,402,414,15485,15416,2767,6922,6920,6974,18002,9122,15444,65,1774,4939,18331,22828,22209,19841,22402,22841,18232}'::integer[]))
Total runtime: 1450.444 ms

Best Answer

Related Solutions

Postgresql – Efficient query to get greatest value per group from big table

Index

Correlated subquery

LATERAL join in Postgres 9.3+

Function

Postgresql – Speed up SELECT with WINDOW, (PARTITION … ORDER) with compound index

Related Question

`LATERAL` join in Postgres 9.3+