Postgresql – Get top two rows per group efficiently

database-designgreatest-n-per-grouppostgresqlpostgresql-performance

Currently, I use a query with DISTINCT ON to get a list of recent posts for each author.

But how can I get previous recent posts for each author? In other words, how to make DISTINCT ON return second row of each group, not first.

I need this to compare recent post with previous one.

CREATE TABLE posts (
  title varchar(30),
  author varchar(30),
  created_at date
);

INSERT INTO posts VALUES
('Johns first post', 'John', 'January 1, 2021'),
('Johns second post', 'John', 'January 2, 2021'),
('Johns third post', 'John', 'January 3, 2021'),
('Mikes first post', 'Mike', 'January 1, 2021'),
('Mikes second post', 'Mike', 'January 2, 2021'),
('Mikes third post', 'Mike', 'January 3, 2021');

This query selects the most recent post for each author. (Third post for each in my example.):

SELECT DISTINCT ON (author) * FROM posts ORDER BY author ASC, created_at DESC

db<>fiddle here

I also need the previous post for each author (Second post for each in my example.)

I don't want to use window functions because my table is large enough, and I suppose window functions could be slow.

Best Answer

DISTINCT ON is only good to get a single (distinct) row per group. And only the one you can sort on top somehow. Even then it's only efficient with few rows per group. See:

I am going to assume a big table with many posts per author (the typical case).

Simple and slow

Can be built around row_number() in a subquery:

SELECT *
FROM  (
   SELECT *, row_number() OVER (PARTITION BY author ORDER BY created_at DESC NULLS LAST) AS post_num
   FROM   posts
   ) p
WHERE  post_num < 3;

DESC NULLS LAST since all your columns can be NULL.
You'd really want all columns to be NOT NULL and created_at to be timestamptz. (See below for more.)

The query will use a sequential scan, which is very inefficient for the case. (Just like you supposed.)

Sophisticated and fast

You'd want to use an index efficiently. Assuming a table design as primitive as displayed, one like:

CREATE INDEX ON posts (author DESC NULLS LAST, created_at DESC NULLS LAST);

We can put this to work, with some sophistication:

WITH RECURSIVE cte AS (
   (
   SELECT *
   FROM   posts
   ORDER  BY author DESC NULLS LAST, created_at DESC NULLS LAST
   LIMIT  1
   )

   UNION ALL
   SELECT p.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT *
      FROM   posts p
      WHERE  p.author < c.author  -- lateral reference
      ORDER  BY author DESC NULLS LAST, created_at DESC NULLS LAST
      LIMIT  1
      ) p
)
SELECT *, 1 AS post_num
FROM   cte

UNION ALL
SELECT p.*
FROM   cte c
CROSS  JOIN LATERAL (
   SELECT *, 2 AS post_num
   FROM   posts p
   WHERE  p.author = c.author
   AND    p.created_at < c.created_at   -- assuming no two posts with same date
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) p;

db<>fiddle here

The first step is a classic recursive CTE to get the first post per author. Detailed explanation here:

The second step it to fetch the next post for each author in a LATERAL subquery - making use of the index once more.

Simple and fast

In a proper relational design, you'd have a separate author table like:

CREATE TABLE author (
  author_id int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, author    text NOT NULL
);

INSERT INTO author(author) VALUES
  ('John')
, ('Mike');

CREATE TABLE post (
  post_id    int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, author_id  int NOT NULL REFERENCES author
, title      varchar(30) NOT NULL
, created_at timestamptz NOT NULL DEFAULT now()
);

INSERT INTO post (author_id, title, created_at) VALUES
  (1, 'Johns first post',  'January 1, 2021')
, (1, 'Johns second post', 'January 2, 2021')
, (1, 'Johns third post',  'January 3, 2021')
, (2, 'Mikes first post',  'January 1, 2021')
, (2, 'Mikes second post', 'January 2, 2021')
, (2, 'Mikes third post',  'January 3, 2021')
;

Then the index can simply be:

CREATE INDEX ON post (author_id, created_at);

And we can have a very simple and very efficient query:

SELECT p.*
FROM   author a
CROSS  JOIN LATERAL (
   SELECT *
   FROM   post
   WHERE  author_id = a.author_id
   ORDER  BY created_at DESC
   LIMIT  2
   ) p;

db<>fiddle here