PostgreSQL Top-K with Ties – How to Implement

postgresqlpostgresql-9.3top

Assume I have the following DB table with 3 integer fields.

A | B | C
1 | 2 | 3
1 | 2 | 4
1 | 3 | 1
2 | 4 | 2
2 | 4 | 3

When I do:

SELECT * FROM dbTable ORDER BY A,B LIMIT 1;

I get

1 | 2 | 3

which is expected. But the 2nd record:

 1 | 2 | 4

also has the same values for dbFields A and B. Is there any efficient way to actually retrieve all records that have the same value as the top-k records? For example, when I search for the first 100 records to get instead 102 records if the last two have the same values as the 100th record? Is there any index to accelerate such queries? I do not mind if it has to be done with pl/pgsql (and not plain SQL) if the implementation is efficient.

Best Answer

You can use a window function for this:

select a,b,c
from (
  select a,b,c,
         dense_rank() over (order by a,b) as rnk
  from dbTable
) t
where rnk = 1;

For the "first" rows, it doesn't matter if you use rank() or dense_rank(). When you e.g. want the "second" ones, the rank() and dense_rank() would return different results in case of ties. Because rank() will have "gaps" in the numbers, but dense_rank() will not.


A possible speedup might be achieved by doing this in two steps and of course having an index on (a,b)

with ranked as (
  select *
  from (
    select a,b,
           dense_rank() over (order by a,b) as rnk
    from dbTable
  ) t
  where r.rnk = 1  -- (or <= for "top-k")
)
select t.a, t.b, t.c
from dbTable t
   join ranked r on r.a = t.a and r.b = t.b;

The idea is to give Postgres the chance to do an index-only scan for the ranking part and then join only the result of that to the base table to get the remaining column(s). The filtering on the rank is done inside the CTE as Postgres doesn't push down conditions from the outer query into the CTE itself (that's why I have the derived table inside the CTE)

I'm not sure if this really improves the performance, but I guess it would be worth trying and have a look a the execution plan with the real tables (and data).