PostgreSQL – Selecting N Rows Before or After a Specified Row

postgresql

Let's say I have a simple table in Postgres (11.3):

create table posts
(
    id serial not null,
    created_at timestamp(0)
    constraint posts_pkey
        primary key (id)
);

If a user requests id=5869, I need to be able to return the N rows before and the N rows after that row in a query that is ordered by the created_at column. If we're able to assume that the larger the id, the larger the created_at, we can do something relatively simple like this:

(select * from posts where id < 5869 order by id limit 10)
union all
(select * from posts where id >= 5869 order by id limit 11);

However, I am unable to assume that the higher id is most recently created and I'm wondering what the best way to retrieve that data is in that case. This method works, but is remarkably slow on a 100k row dataset:

WITH 
  boundaries AS (
    SELECT *,
           row_number() OVER (ORDER BY created_at DESC) AS rownum
    FROM posts
  ),
  target_boundary AS (
     SELECT *
     FROM boundaries
     WHERE boundaries.id = 5869
 )
SELECT posts.*, boundaries.rownum
FROM posts
LEFT JOIN boundaries ON posts.id = boundaries.id
JOIN target_boundary ON boundaries.rownum BETWEEN target_boundary.rownum - 10 AND target_boundary.rownum + 10

Running through that was taking upwards of 800 milliseconds which is far too slow on a dataset so small.

I have also tried a variation of the above using lead() and lag(), but that was even less efficient.

Is there a better way to do this query? Is there perhaps a window function I am missing in Postgres that would handle it?

Best Answer

Use the power of UNION ALL:

WITH init AS (
   SELECT created_at
   FROM posts 
   WHERE id = 5869
)
(
   (SELECT posts.*
    FROM posts
       CROSS JOIN init
    WHERE posts.created_at <= init.created_at 
    ORDER BY posts.created_at DESC 
    LIMIT 11) 
 UNION ALL 
   (SELECT posts.* 
    FROM posts 
       CROSS JOIN init 
    WHERE posts.created_at > init.created_at 
    ORDER BY posts.created_at 
    LIMIT 10)
);

This query assumes that there are no duplicates in created_at.

For good performance, you need indexes on id (you have that with the primary key) and created_at.

If you need the result sorted, use my query as a subselect and add an ORDER BY.

Related Solutions

Postgresql – Performing SELECT on EACH ROW in CTE or Nested QUERY

Let me know if this helps. This is using MSSQL (T-SQL) syntax so you might have to adjust it for postgres, but, using a temp table called #tree, containing the two columns id and parent and populated as your example is with

SELECT 1, 0
UNION 
SELECT 2, 1
UNION 
SELECT 3, 1
UNION
SELECT 4, 2
UNION 
SELECT 5, 2
UNION 
SELECT 6, 4
UNION
SELECT 7, 6
UNION 
SELECT 8, 6:

with cte as
(
    select id as currentnode, id as root
    from #tree
    union all
    select t.id as currentnode, cte.root 
    from #tree t join cte on t.parent = cte.currentnode
)
select *
from cte
order by root

The above recursive CTE will yield all nodes reachable starting from some root. You can do:

SELECT root, count(*)
from cte
group by root

The count(*) query should yield,

root   |  count
----------------
  1    |    8
  2    |    6
  3    |    1
  4    |    4
  5    |    1
  6    |    3
  7    |    1
  8    |    1

You could then subtract 1 from each of those to get the result you were hinting at. Is this what you were looking for? Even if not, I feel like a recursive CTE can probably help with what you need.

PostgreSQL – Troubleshooting Very Slow SELECT on Single Table

Running an EXPLAIN ANALYSE VERBOSE massively underestimates the time taken to return the results

There's a misunderstanding here, because EXPLAIN ANALYZE does not estimate, it runs the query for real and reports the actual time taken by each steps, as opposed to EXPLAIN without ANALYZE that just reports the estimates without running the query.

Consider this line from your EXPLAIN ANALYZE output:

(actual time=0.135..594.929 rows=426934 loops=1)

These 595 milliseconds are not an estimate, it's actual time. Besides, when postgres shows estimates, they're expressed in units of "cost", not in units of time.

When it's finished, EXPLAIN ANALYZE sends back the result of the analyze to the client, and discards the rows of the actual result. That differs from a real SELECT which has to send back the rows to the client.

For this reason, the big difference between both operations in your case could/should be accounted by the slowness in receiving the results, either because the network is slow, or because the client is slow, or both. Maybe the client is swapping like crazy if the resultset is too big to fit in the available RAM. I'd look at the vmstat 1 and iftop outputs (or even strace) on server and client during these 2 minutes to check what's doing what.

Best Answer

Related Solutions

Postgresql – Performing SELECT on EACH ROW in CTE or Nested QUERY

PostgreSQL – Troubleshooting Very Slow SELECT on Single Table

Related Question