Mariadb – Variation on greatest-n-per-group where whole table is needed with indicator, not just greatest rows per group

greatest-n-per-groupmariadb

I have a query that returns some rows having a grouping on a column and dates. I would like to have that query indicate in a column which of each group has the greatest date.

There are a vast number of questions on how to do the "greatest n per group", but those questions only want the row from each group with the greatest value. I want all the rows, and just an indication of which is the greatest.

For example, here is a table:

id	rep	date
1	27058	2020-11-16
2	27058	2020-11-09
3	27058	2020-10-30
4	46029	2020-11-03

I would like to have something like the following:

id	rep	date	greatest
1	27058	2020-11-16	1
2	27058	2020-11-09	0
3	27058	2020-10-30	0
4	46029	2020-11-03	1

I have experimented with some greatest-n-per-group methods to add a "greatest" indicator. But since those attempts are focused on the only the row with the greatest value, when the part of the where clause like "table2.date IS NULL" is removed, extra rows are added with all the join combinations.

Here is an example of such an experiment that does not work. It tries to use the greatest-n-per-group but comments out the IS NULL line:

SELECT DISTINCT m1.usermessageid, m1.repid, m1.sender_role, m1.message, m1.datesent, m2.datesent
FROM usermessage m1
LEFT JOIN usermessage m2 ON m1.principalid = m2.principalid and m1.repid = m2.repid AND m1.datesent < m2.datesent
WHERE m1.principalid = 99831
AND m1.repid IN (21600,99214,27058,97360,92678,96184,55429,46029)
# AND m2.datesent IS NULL
ORDER BY m1.datesent desc;

I would like to avoid having to add an additional query just to get the greatest rows.

Best Answer

Use a window function:

SELECT ROW_NUMBER() OVER (PARTITION BY repid ORDER BY date DESC) = 1 AS greatest, ...

This requires MySQL 8.0.02 or MariaDB 10.2.0 to use window functions.

Index

A plain multicolumn B-tree index should work after all:

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);

Why DESC NULLS LAST?

Unused index in range of dates query

It's safe to assume you have an equipment table? Then performance won't be a problem:

Correlated subquery

Based on this equipment table, run a lowly correlated subquery to great effect:

SELECT equipment_id
     ,(SELECT created_at
       FROM   geoposition_records
       WHERE  equipment_id = eq.equipment_id
       ORDER  BY created_at DESC NULLS LAST
       LIMIT  1) AS latest
FROM   equipment eq;

For a small number of rows in the equipment table (57 judging from your EXPLAIN ANALYZE output), that's very fast.

`LATERAL` join in Postgres 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;

Detailed explanation:

Optimize GROUP BY query to retrieve latest record per user

Performance similar to the correlated subquery.

Function

If you can't talk sense into the query planner (which shouldn't occur), a function looping through the equipment table is certain to do the trick. Looking up one equipment_id at a time uses the index.

CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   FOR equipment_id IN
      SELECT e.equipment_id FROM equipment e ORDER BY 1
   LOOP
      SELECT g.created_at
      FROM   geoposition_records g
      WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
      ORDER  BY g.created_at DESC NULLS LAST
      LIMIT  1
      INTO   latest;

      RETURN NEXT;
   END LOOP;
END  
$func$;

Makes for a nice call, too:

SELECT * FROM f_latest_equip();

Performance comparison:

db<>fiddle here
_{OLD sqlfiddle}

Postgresql – How to properly implement compound greatest-n filtering

The simple solution in Postgres is with DISTINCT ON:

SELECT DISTINCT ON (r.series)
       r.volume  AS releases_volume
     , r.chapter AS releases_chapter
     , r.series  AS releases_series
FROM   releases r
WHERE  r.series IN (
    12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499
  , 556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137
  , 1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768
  , 1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255
  , 2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606
  , 2634, 2636, 2695, 2696)
AND    r.include
ORDER  BY r.series, r.volume DESC NULLS LAST, r.chapter DESC NULLS LAST;

Details:

Select first row in each GROUP BY group?

Depending on data distribution there may be faster techniques:

Optimize GROUP BY query to retrieve latest record per user

Also, there are faster alternatives for long lists than IN ().

Combining an unnested array with a LATERAL join:

SELECT r.*
FROM   unnest('{12, 17, 44, 79, 88, 110, 129}'::int[]) t(i)  -- or many more items
     , LATERAL (
   SELECT volume  AS releases_volume
        , chapter AS releases_chapter
        , series  AS releases_series
   FROM   releases
   WHERE  series = t.i 
   AND    include
   ORDER  BY series, volume DESC NULLS LAST, chapter DESC NULLS LAST
   LIMIT  1
   ) r;

Is often faster. For best performance you need a matching multicolumn index like:

CREATE INDEX releases_series_volume_chapter_idx
ON releases(series, volume DESC NULLS LAST, chapter DESC NULLS LAST);

Extremely slow query on indexed column

And if there are more than a few rows where include is not true, while you are only interested in the rows with include = true, then consider a partial multicolumn index:

CREATE INDEX releases_series_volume_chapter_idx
ON releases(series, volume DESC NULLS LAST, chapter DESC NULLS LAST)
WHERE include;

Best Answer

Related Solutions

Postgresql – Efficient query to get greatest value per group from big table

Index

Correlated subquery

LATERAL join in Postgres 9.3+

Function

Postgresql – How to properly implement compound greatest-n filtering

Related Question

`LATERAL` join in Postgres 9.3+