PostgreSQL – How to Speed Up DISTINCT Query?

distinctperformancepostgresqlpostgresql-10query-performance

I have a table t in a database (PostgreSQL 10.4):

\d t;
                Table "public.t"
  Column  |          Type          | Collation | Nullable | Default 
----------+------------------------+-----------+----------+---------
 sn       | character varying(11)  |           |          | 
 site     | character varying(50)  |           |          | 
Indexes:
    "site_2018_idx" btree (site), tablespace "indexspace"
    "sn_2018_idx" btree (sn), tablespace "indexspace"

I need to find distinct 'sn's for specific site, and I do it like this:

SELECT DISTINCT sn FROM t WHERE site='a_b301_1' ORDER BY sn ;

It works, but is very slow, to return 75 distinct 'sn' values it takes about 8 minutes!
Is there a way to speed it up?
The explain analyze gives this output:

QUERY PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=42873094.21..42873103.25 rows=3615 width=12) (actual time=190431.413..190431.417 rows=75 loops=1)
   Output: sn
   Sort Key: t.sn
   Sort Method: quicksort  Memory: 28kB
   ->  HashAggregate  (cost=42872844.42..42872880.57 rows=3615 width=12) (actual time=190431.233..190431.263 rows=75 loops=1)
         Output: sn
         Group Key: t.sn
         ->  Bitmap Heap Scan on public.t  (cost=874850.36..42695793.24 rows=70820471 width=12) (actual time=8755.163..168773.143 rows=43096912 loops=1)
               Output: sn, site
               Recheck Cond: ((t.site)::text = 'a_b301_1'::text)
               Heap Blocks: exact=783666
               ->  Bitmap Index Scan on site_2018_idx  (cost=0.00..857145.24 rows=70820471 width=0) (actual time=8540.835..8540.835 rows=43096912 loops=1)
                     Index Cond: ((t.site)::text = 'a_b301_1'::text)
 Planning time: 0.466 ms
 Execution time: 190433.289 ms
(15 rows)

Additional information

After I created another index, as was recommended (site, sn), the time decreased significantly, from 8 min to 30 sec. It is great, I just don't understand why is this the case? How is one multi-column index better than two separate indexes, in this case?

Best Answer

As suggested, an index on (site, sn) could speed this up, especially if the table is well-vacuumed so that it benefits from index-only scans.

If that is not enough of a speed up, then there is a way to use this index in a "skip scan" or "Loose Index Scan" to speed up queries where the number of distinct values is much lower than the size of the index. Unfortunately PostgreSQL's planner does not automatically detect opportunities for this, but you can force it yourself by writing a recursive common table expression. You can find a discussion of this technique on the PostgreSQL wiki, although you will need to adapt it to your parameterized version.

The resulting queries are quite ugly, so you might want to wrap them into a view. Or in this case, since it is parameterized, into a set-returning function. Something like this works for me:

CREATE FUNCTION public.foobar(text) RETURNS SETOF text
    LANGUAGE sql
    AS $_$ with recursive r as (
  (select sn from t where site = $1 order by sn limit 1)
   union all
  SELECT (SELECT t.sn FROM t WHERE site=$1 and t.sn > r.sn ORDER BY sn LIMIT 1) from r where sn is not null
)
select * from r where r.sn is not null $_$;

Setting up the table and index with:

create table t as select 
    floor(random()*1.2)::int::varchar as site, 
    (-floor(50*log(random()))::int)::varchar as sn 
from generate_series(1,10000000);

create index on t (site ,sn);