PostgreSQL – How to Get Multiple Last Values for Different Criteria from a Single Column

greatest-n-per-grouppostgresqlpostgresql-performance

Our interest_log table stores events of users expressing interest on foo.

CREATE TABLE interest_log (
    id bigint NOT NULL PRIMARY KEY,

    user_id bigint REFERENCES foo_user(id),
    foo_id bigint NOT NULL REFERENCES foo(id),

    action varchar(255) NOT NULL,
    comment varchar(300) -- nullable
);

Some of the available actions are:

VIEW, FAVOR, UNFAVOR, DISMISS, and UNDISMISS.

I want to create a materialized view that provides the latest state of whether a user favors or dismisses a foo.

CREATE MATERIALIZED VIEW interest
AS
SELECT
    il.user_id,
    il.foo_id,
    ??? AS is_dismissed,
    ??? AS is_favored
FROM
    interest_log AS il,
WHERE
    ???

I came up with this query to select is_dismissed – but it feels too complicated and I don't see how I can combine it with fetching the latest favored state as well.

SELECT
  il.action = 'DISMISS' AS is_dismissed
FROM interest_log AS il
WHERE
  il.action IN ('DISMISS', 'UNDISMISS')
  AND NOT EXISTS (
    SELECT 1
    FROM interest_log AS il2
    WHERE
      il.action IN ('DISMISS', 'UNDISMISS')
      AND il2.id < il.id

What could the materialized view query look like and what fields/indexes should we add to make refreshs fast?

postgres:10.11

Best Answer

Assuming only few entries per (user_id, foo_id), the query you have is typically faster this way:

SELECT DISTINCT ON (user_id, foo_id)
       user_id, foo_id
       action = 'DISMISS' AS is_dismissed
FROM   interest_log
ORDER  BY user_id, foo_id, id DESC
WHERE  action IN ('DISMISS', 'UNDISMISS');

But this only produces rows for (user_id, foo_id) that actually have (un-)dismissed.

If there are relatively few rows per (user_id, foo_id) for both items of interest (is_dismissed and is_favored), just run a FULL [OUTER] JOIN on two of these queries to produce the set of all that either (un-)dismissed or (un-)favored:

SELECT *
FROM  (
   SELECT DISTINCT ON (user_id, foo_id)
          user_id, foo_id
        , action = 'DISMISS' AS is_dismissed
   FROM   interest_log
   WHERE  action IN ('DISMISS', 'UNDISMISS')
   ORDER  BY user_id, foo_id, id DESC
   ) d
FULL   JOIN (
   SELECT DISTINCT ON (user_id, foo_id)
          user_id, foo_id
        , action = 'FAVOR' AS is_favored
   FROM   interest_log
   WHERE  action IN ('FAVOR', 'UNFAVOR')
   ORDER  BY user_id, foo_id, id DESC
   ) f USING (user_id, foo_id);

About DISTINCT ON and few vs. many:

Select first row in each GROUP BY group?

Support with two partial indexes:

CREATE INDEX interest_log_dismiss ON interest_log (user_id, foo_id, id DESC, action)
WHERE  action IN ('DISMISS', 'UNDISMISS');

CREATE INDEX interest_log_favor ON interest_log (user_id, foo_id, id DESC, action)
WHERE  action IN ('FAVOR', 'UNFAVOR');

Or using a custom aggregate function last():

CREATE OR REPLACE FUNCTION public.last_agg (anyelement, anyelement)
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS 'SELECT $2';

CREATE AGGREGATE public.last(anyelement) (
  SFUNC = public.last_agg
, STYPE = anyelement
);

Then this simpler query does the same:

SELECT user_id, foo_id
     , last(action) FILTER (WHERE action IN ('DISMISS', 'UNDISMISS')) = 'DISMISS' AS is_dismissed
     , last(action) FILTER (WHERE action IN ('FAVOR', 'UNFAVOR'))     = 'FAVOR'   AS is_favored
FROM  (
   SELECT *
   FROM   interest_log
   WHERE  action IN ('DISMISS', 'UNDISMISS', 'FAVOR', 'UNFAVOR')
   ORDER  BY user_id, foo_id, id
   ) sub
GROUP  BY user_id, foo_id;

Faster with the additional module first_last_agg providing a C implementation. See:

Calculating follower growth over time for each influencer

There are more ways (have a look at the linked answer), and depending on undisclosed information, different query styles may be (much) faster.

Related Solutions

PostgreSQL – Efficient Query to Get Last Row Grouped by Multiple Columns

The simplest way in Postgres is with DISTINCT ON:

SELECT DISTINCT ON (1,2,3,4)
       sp.game_id, sportsbook_id, spread_type, spread_duration, game_update_count
FROM   spreads sp
LEFT   JOIN schedule sch USING (game_id)
WHERE  date >= '2012-01-01'
AND    date <= '2012-01-02'
ORDER  BY 4,3,2,1, game_update_count DESC;

Details:

Select first row in each GROUP BY group?

The numbers are just syntax shorthand referring to the ordinal position of SELECT items.

If game_update_count can be NULL, you'll want game_update_count DESC NULLS LAST.

PostgreSQL sort by datetime asc, null first?

PostgreSQL – Get Last 5 Distinct Values for Each ID

There are probably many ways to do this. The first that comes to mind is to use window functions:

SELECT 
    id, postcode
FROM
  ( SELECT id, postcode, 
           ROW_NUMBER() OVER (PARTITION BY id
                              ORDER BY MAX(date_created) DESC
                             ) AS rn
    FROM tablename
    GROUP BY id, postcode
  ) AS t
WHERE
    rn <= 5
ORDER BY 
    id, rn ;

Test at SQLfiddle.

If there are ties, say the 5th, 6th and 7th postcode for an id have the same date_created, only one of them (choice will be arbitrary) will be in the results. If you want all the tied postcodes in those cases, use RANK() instead of ROW_NUMBER().

Another option is to use the LATERAL syntax. I'm not sure which will be more efficient, it wil probably depend on the values distribution of the two columns (id and postcode), i.e. how many distinct ids in the whole table, how many distinct postcodes per id and how many rows per (id, postcode) combinations.

SELECT 
    t.id, ti.postcode
FROM
    ( SELECT DISTINCT id
      FROM tablename
    ) AS t
  CROSS JOIN LATERAL
    ( SELECT tt.postcode,
             MAX(tt.date_created) AS date_created
      FROM tablename AS tt
      WHERE tt.id = t.id
      GROUP BY tt.postcode
      ORDER BY date_created DESC
      LIMIT 5
    ) AS ti 
ORDER BY 
    t.id, ti.date_created DESC;

Adding an index on (id, postcode, date_created) would be a good idea, too - or on (id, postcode, date_created DESC).

Best Answer

Related Solutions

PostgreSQL – Efficient Query to Get Last Row Grouped by Multiple Columns

PostgreSQL – Get Last 5 Distinct Values for Each ID

Related Question