Postgresql – Get top two rows per group efficiently

database-designgreatest-n-per-grouppostgresqlpostgresql-performance

Currently, I use a query with DISTINCT ON to get a list of recent posts for each author.

But how can I get previous recent posts for each author? In other words, how to make DISTINCT ON return second row of each group, not first.

I need this to compare recent post with previous one.

CREATE TABLE posts (
  title varchar(30),
  author varchar(30),
  created_at date
);

INSERT INTO posts VALUES
('Johns first post', 'John', 'January 1, 2021'),
('Johns second post', 'John', 'January 2, 2021'),
('Johns third post', 'John', 'January 3, 2021'),
('Mikes first post', 'Mike', 'January 1, 2021'),
('Mikes second post', 'Mike', 'January 2, 2021'),
('Mikes third post', 'Mike', 'January 3, 2021');

This query selects the most recent post for each author. (Third post for each in my example.):

SELECT DISTINCT ON (author) * FROM posts ORDER BY author ASC, created_at DESC

db<>fiddle here

I also need the previous post for each author (Second post for each in my example.)

I don't want to use window functions because my table is large enough, and I suppose window functions could be slow.

Best Answer

DISTINCT ON is only good to get a single (distinct) row per group. And only the one you can sort on top somehow. Even then it's only efficient with few rows per group. See:

Select first row in each GROUP BY group?

I am going to assume a big table with many posts per author (the typical case).

Simple and slow

Can be built around row_number() in a subquery:

SELECT *
FROM  (
   SELECT *, row_number() OVER (PARTITION BY author ORDER BY created_at DESC NULLS LAST) AS post_num
   FROM   posts
   ) p
WHERE  post_num < 3;

DESC NULLS LAST since all your columns can be NULL.
You'd really want all columns to be NOT NULL and created_at to be timestamptz. (See below for more.)

The query will use a sequential scan, which is very inefficient for the case. (Just like you supposed.)

Sophisticated and fast

You'd want to use an index efficiently. Assuming a table design as primitive as displayed, one like:

CREATE INDEX ON posts (author DESC NULLS LAST, created_at DESC NULLS LAST);

We can put this to work, with some sophistication:

WITH RECURSIVE cte AS (
   (
   SELECT *
   FROM   posts
   ORDER  BY author DESC NULLS LAST, created_at DESC NULLS LAST
   LIMIT  1
   )

   UNION ALL
   SELECT p.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT *
      FROM   posts p
      WHERE  p.author < c.author  -- lateral reference
      ORDER  BY author DESC NULLS LAST, created_at DESC NULLS LAST
      LIMIT  1
      ) p
)
SELECT *, 1 AS post_num
FROM   cte

UNION ALL
SELECT p.*
FROM   cte c
CROSS  JOIN LATERAL (
   SELECT *, 2 AS post_num
   FROM   posts p
   WHERE  p.author = c.author
   AND    p.created_at < c.created_at   -- assuming no two posts with same date
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) p;

db<>fiddle here

The first step is a classic recursive CTE to get the first post per author. Detailed explanation here:

Optimize GROUP BY query to retrieve latest row per user

The second step it to fetch the next post for each author in a LATERAL subquery - making use of the index once more.

Simple and fast

In a proper relational design, you'd have a separate author table like:

CREATE TABLE author (
  author_id int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, author    text NOT NULL
);

INSERT INTO author(author) VALUES
  ('John')
, ('Mike');

CREATE TABLE post (
  post_id    int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, author_id  int NOT NULL REFERENCES author
, title      varchar(30) NOT NULL
, created_at timestamptz NOT NULL DEFAULT now()
);

INSERT INTO post (author_id, title, created_at) VALUES
  (1, 'Johns first post',  'January 1, 2021')
, (1, 'Johns second post', 'January 2, 2021')
, (1, 'Johns third post',  'January 3, 2021')
, (2, 'Mikes first post',  'January 1, 2021')
, (2, 'Mikes second post', 'January 2, 2021')
, (2, 'Mikes third post',  'January 3, 2021')
;

Then the index can simply be:

CREATE INDEX ON post (author_id, created_at);

And we can have a very simple and very efficient query:

SELECT p.*
FROM   author a
CROSS  JOIN LATERAL (
   SELECT *
   FROM   post
   WHERE  author_id = a.author_id
   ORDER  BY created_at DESC
   LIMIT  2
   ) p;

db<>fiddle here

Related Solutions

Need help on database design

I suggest you build seperate tables for all repeatable data. This is good database design and the DBMS you are using should be able to handle it. It provides you with these benefits:

You arrange data into logical groupings such that each group describes a small part of the whole.
You minimize the amount of duplicate data stored in your database.
You organize the data so when you modify it, you make the change in only one place.
You build a database in which you can access and manipulate the data quickly and efficiently.
Your database is better maintainable in case your data model changes.

Postgresql – Get rows with most recent date for each different item

Skinning Q#1 independently and slightly differently than @ypercube

with cte as (select row_number() over (partition by product_id,
                                       invoice_id 
                                 order by dt desc) as rn,
                    product_id,
                    invoice_id,
                    amount,dt
               from product ) 
select product_id, invoice_id,amount,dt
  from cte
 where rn=1
 order by product_id,invoice_id;

 product_id | invoice_id | amount |     dt     
------------+------------+--------+------------
 PROD1      | INV01      |      2 | 2018-01-05
 PROD1      | INV03      |      1 | 2018-01-05
 PROD2      | INV02      |      3 | 2018-01-08
 PROD2      | INV04      |      4 | 2018-01-08
(4 rows)

For Q#2, you are on the right track, but the SQL will have a cross join(gasp!)

I think a function with a loop/cursor would be more optimized (i'll try that in my next free block of time)

--the cte will give us the real values
with cte as (select product_id, 
                    sum(amount) as amount, 
                    dt
               from product
              group by product_id,dt)
select p.product_id,  
       (select cte.amount --choose the amount
          from cte
         where cte.product_id = p.product_id
           and cte.dt <= d.gdt -- for same day or earlier
         order by cte.dt desc
         limit 1) as finamt,
       d.gdt
from (select generate_series( (select min(dt)
                                 from product), --where clause if some products 
                                                --don't have an amount
                              (select max(dt)
                                 from product),
                              '1 day' 
                            )::date as gdt)  d
cross join --assuming each listed product has an amount on the min date
     (select distinct product_id
        from product) p
left join --since we need to fill the gaps
     cte on ( d.gdt = cte.dt 
             and p.product_id = cte.product_id)
order by d.gdt, p.product_id
;

Best Answer

Simple and slow

Sophisticated and fast

Simple and fast

Related Solutions

Need help on database design

Postgresql – Get rows with most recent date for each different item

Related Question