Postgresql – Distinct result with row_id for limit query

limitsoffset-fetchpostgresqlsubquery

There are 2 tables:

CREATE TABLE sample1 (
  id int,
  name varchar,
  score int,
  mode int
)

CREATE TABLE sample2 (
  id int,
  mode int
  ...
)

I need to select distinct name with MIN(score) and ROW_NUMBER(). Must sort by ROW_NUMBER() ASC. The mode must exist in table sample2 for a row from sample1, so there is INNER JOIN.

ROW_NUMBER() is needed for sequential LIMIT queries so I can select data in batches. I don't care about consistency of that LIMIT+ROW_NUMBER() trick.

My query:

WITH sample1 AS (
    WITH sample1 AS (
        SELECT a.name, MIN(a.score) s
        FROM sample1 a
            JOIN sample2 USING (mode)
            GROUP BY a.name
    )
    SELECT *, ROW_NUMBER() OVER (ORDER BY s ASC) AS n
    FROM sample1
) SELECT * FROM sample1 WHERE n > $1 LIMIT $2

I am concerned about efficiency of that select, in fact it's 3 level deep subquery. Perhaps there are less ugly and more efficient ways to achieve same result?

The number of rows in this table will be approximately 1,000,000 and distinct set is expected to be 100,000. Queries are frequent, actually it's a hot table for select, only score column will be updated in batches of the same size as limit select.

Best Answer

Would seem like you do not need any CTE, subquery or even JOIN at all. Just an EXISTS semi-join and an additional OFFSET:

SELECT name
     , MIN(score) AS s
     , ROW_NUMBER() OVER (ORDER BY MIN(score)) AS n  -- still needed?
FROM   sample1 a
WHERE  EXISTS (SELECT FROM sample2 s2 WHERE s2.mode = a.mode)
GROUP  BY name
ORDER  BY s
OFFSET $1
LIMIT  $2;

A large OFFSET is always costly, as all the leading, discarded rows have to be considered in the ordering first.

If it's a read-only table, you could add a MATERIALIZED VIEW with an index on the row number and query with SELECT * FROM mv WHERE n > $1 AND n <= $1 + $2.
Would help most for big $1 and small $2.