PostgreSQL – How to Get Multiple Last Values for Different Criteria from a Single Column

greatest-n-per-grouppostgresqlpostgresql-performance

Our interest_log table stores events of users expressing interest on foo.

CREATE TABLE interest_log (
    id bigint NOT NULL PRIMARY KEY,

    user_id bigint REFERENCES foo_user(id),
    foo_id bigint NOT NULL REFERENCES foo(id),

    action varchar(255) NOT NULL,
    comment varchar(300) -- nullable
);

Some of the available actions are:

VIEW, FAVOR, UNFAVOR, DISMISS, and UNDISMISS.

I want to create a materialized view that provides the latest state of whether a user favors or dismisses a foo.

CREATE MATERIALIZED VIEW interest
AS
SELECT
    il.user_id,
    il.foo_id,
    ??? AS is_dismissed,
    ??? AS is_favored
FROM
    interest_log AS il,
WHERE
    ???

I came up with this query to select is_dismissed – but it feels too complicated and I don't see how I can combine it with fetching the latest favored state as well.

SELECT
  il.action = 'DISMISS' AS is_dismissed
FROM interest_log AS il
WHERE
  il.action IN ('DISMISS', 'UNDISMISS')
  AND NOT EXISTS (
    SELECT 1
    FROM interest_log AS il2
    WHERE
      il.action IN ('DISMISS', 'UNDISMISS')
      AND il2.id < il.id

What could the materialized view query look like and what fields/indexes should we add to make refreshs fast?

postgres:10.11

Best Answer

Assuming only few entries per (user_id, foo_id), the query you have is typically faster this way:

SELECT DISTINCT ON (user_id, foo_id)
       user_id, foo_id
       action = 'DISMISS' AS is_dismissed
FROM   interest_log
ORDER  BY user_id, foo_id, id DESC
WHERE  action IN ('DISMISS', 'UNDISMISS');

But this only produces rows for (user_id, foo_id) that actually have (un-)dismissed.

If there are relatively few rows per (user_id, foo_id) for both items of interest (is_dismissed and is_favored), just run a FULL [OUTER] JOIN on two of these queries to produce the set of all that either (un-)dismissed or (un-)favored:

SELECT *
FROM  (
   SELECT DISTINCT ON (user_id, foo_id)
          user_id, foo_id
        , action = 'DISMISS' AS is_dismissed
   FROM   interest_log
   WHERE  action IN ('DISMISS', 'UNDISMISS')
   ORDER  BY user_id, foo_id, id DESC
   ) d
FULL   JOIN (
   SELECT DISTINCT ON (user_id, foo_id)
          user_id, foo_id
        , action = 'FAVOR' AS is_favored
   FROM   interest_log
   WHERE  action IN ('FAVOR', 'UNFAVOR')
   ORDER  BY user_id, foo_id, id DESC
   ) f USING (user_id, foo_id);

About DISTINCT ON and few vs. many:

Support with two partial indexes:

CREATE INDEX interest_log_dismiss ON interest_log (user_id, foo_id, id DESC, action)
WHERE  action IN ('DISMISS', 'UNDISMISS');

CREATE INDEX interest_log_favor ON interest_log (user_id, foo_id, id DESC, action)
WHERE  action IN ('FAVOR', 'UNFAVOR');

Or using a custom aggregate function last():

CREATE OR REPLACE FUNCTION public.last_agg (anyelement, anyelement)
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS 'SELECT $2';

CREATE AGGREGATE public.last(anyelement) (
  SFUNC = public.last_agg
, STYPE = anyelement
);

Then this simpler query does the same:

SELECT user_id, foo_id
     , last(action) FILTER (WHERE action IN ('DISMISS', 'UNDISMISS')) = 'DISMISS' AS is_dismissed
     , last(action) FILTER (WHERE action IN ('FAVOR', 'UNFAVOR'))     = 'FAVOR'   AS is_favored
FROM  (
   SELECT *
   FROM   interest_log
   WHERE  action IN ('DISMISS', 'UNDISMISS', 'FAVOR', 'UNFAVOR')
   ORDER  BY user_id, foo_id, id
   ) sub
GROUP  BY user_id, foo_id;

Faster with the additional module first_last_agg providing a C implementation. See:

There are more ways (have a look at the linked answer), and depending on undisclosed information, different query styles may be (much) faster.