PostgreSQL Top-K with Ties – How to Implement

postgresqlpostgresql-9.3top

Assume I have the following DB table with 3 integer fields.

A | B | C
1 | 2 | 3
1 | 2 | 4
1 | 3 | 1
2 | 4 | 2
2 | 4 | 3

When I do:

SELECT * FROM dbTable ORDER BY A,B LIMIT 1;

I get

1 | 2 | 3

which is expected. But the 2nd record:

 1 | 2 | 4

also has the same values for dbFields A and B. Is there any efficient way to actually retrieve all records that have the same value as the top-k records? For example, when I search for the first 100 records to get instead 102 records if the last two have the same values as the 100th record? Is there any index to accelerate such queries? I do not mind if it has to be done with pl/pgsql (and not plain SQL) if the implementation is efficient.

Best Answer

You can use a window function for this:

select a,b,c
from (
  select a,b,c,
         dense_rank() over (order by a,b) as rnk
  from dbTable
) t
where rnk = 1;

For the "first" rows, it doesn't matter if you use rank() or dense_rank(). When you e.g. want the "second" ones, the rank() and dense_rank() would return different results in case of ties. Because rank() will have "gaps" in the numbers, but dense_rank() will not.

A possible speedup might be achieved by doing this in two steps and of course having an index on (a,b)

with ranked as (
  select *
  from (
    select a,b,
           dense_rank() over (order by a,b) as rnk
    from dbTable
  ) t
  where r.rnk = 1  -- (or <= for "top-k")
)
select t.a, t.b, t.c
from dbTable t
   join ranked r on r.a = t.a and r.b = t.b;

The idea is to give Postgres the chance to do an index-only scan for the ranking part and then join only the result of that to the base table to get the remaining column(s). The filtering on the rank is done inside the CTE as Postgres doesn't push down conditions from the outer query into the CTE itself (that's why I have the derived table inside the CTE)

I'm not sure if this really improves the performance, but I guess it would be worth trying and have a look a the execution plan with the real tables (and data).

Related Solutions

PostgreSQL – Levenshtein Distance and Nested Queries

I suggest to use a trigram index provided by the additional module pg_trgm. Combine that with the length of the string to get a valid pre-selection.

Drop the columns T4,T16 and T64 (faster in a single statement), and run VACUUM FULL or CLUSTER.
Install pg_trgm. Details here:
- Full Text Search With PostgreSQL
Create a GiST index on tl and length(tl).

There are a details to be observed. Related answers on dba.SE involving pg_trgm:

https://dba.stackexchange.com/search?q=pg_trgm

Postgresql – CRUD operations on json data

Matheus de Oliveira created handy functions for JSON CRUD operations in postgresql. They can be imported using the \i directive. Notice the jsonb fork of the functions if jsonb is your data type.

9.3 json https://gist.github.com/matheusoliveira/9488951

(and while jsonb is not an option for Michael, for those who are using 9.4, there is a jsonb variant: https://gist.github.com/inindev/2219dff96851928c2282)

Operations Provided:

postgres=# SELECT json_append('{"a": 1}', '{"b": 2, "c": 3, "a": 4}');
        json_append        
---------------------------
 {"a":1,"b":2,"c":3,"a":4}
(1 row)

postgres=# SELECT json_delete('{"b": 2, "c": 3, "a": 4}', '{b, c}');
 json_delete 
-------------
 {"a":4}
(1 row)

postgres=# SELECT json_update('{"a": 1}', '{"b": 2, "c": 3, "a": 4}');
 json_update 
-------------
 {"a":4}
(1 row)

postgres=# SELECT json_merge('{"a": 1}', '{"b": 2, "c": 3, "a": 4}');
     json_merge      
---------------------
 {"b":2,"c":3,"a":4}
(1 row)

Best Answer

Related Solutions

PostgreSQL – Levenshtein Distance and Nested Queries

Postgresql – CRUD operations on json data

Related Question