I have a table t in a database (PostgreSQL 10.4):
\d t;
Table "public.t"
Column | Type | Collation | Nullable | Default
----------+------------------------+-----------+----------+---------
sn | character varying(11) | | |
site | character varying(50) | | |
Indexes:
"site_2018_idx" btree (site), tablespace "indexspace"
"sn_2018_idx" btree (sn), tablespace "indexspace"
I need to find distinct 'sn's for specific site, and I do it like this:
SELECT DISTINCT sn FROM t WHERE site='a_b301_1' ORDER BY sn ;
It works, but is very slow, to return 75 distinct 'sn' values it takes about 8 minutes!
Is there a way to speed it up?
The explain analyze gives this output:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=42873094.21..42873103.25 rows=3615 width=12) (actual time=190431.413..190431.417 rows=75 loops=1)
Output: sn
Sort Key: t.sn
Sort Method: quicksort Memory: 28kB
-> HashAggregate (cost=42872844.42..42872880.57 rows=3615 width=12) (actual time=190431.233..190431.263 rows=75 loops=1)
Output: sn
Group Key: t.sn
-> Bitmap Heap Scan on public.t (cost=874850.36..42695793.24 rows=70820471 width=12) (actual time=8755.163..168773.143 rows=43096912 loops=1)
Output: sn, site
Recheck Cond: ((t.site)::text = 'a_b301_1'::text)
Heap Blocks: exact=783666
-> Bitmap Index Scan on site_2018_idx (cost=0.00..857145.24 rows=70820471 width=0) (actual time=8540.835..8540.835 rows=43096912 loops=1)
Index Cond: ((t.site)::text = 'a_b301_1'::text)
Planning time: 0.466 ms
Execution time: 190433.289 ms
(15 rows)
Additional information
After I created another index, as was recommended (site, sn)
, the time decreased significantly, from 8 min to 30 sec. It is great, I just don't understand why is this the case? How is one multi-column index better than two separate indexes, in this case?
Best Answer
As suggested, an index on
(site, sn)
could speed this up, especially if the table is well-vacuumed so that it benefits from index-only scans.If that is not enough of a speed up, then there is a way to use this index in a "skip scan" or "Loose Index Scan" to speed up queries where the number of distinct values is much lower than the size of the index. Unfortunately PostgreSQL's planner does not automatically detect opportunities for this, but you can force it yourself by writing a recursive common table expression. You can find a discussion of this technique on the PostgreSQL wiki, although you will need to adapt it to your parameterized version.
The resulting queries are quite ugly, so you might want to wrap them into a view. Or in this case, since it is parameterized, into a set-returning function. Something like this works for me:
Setting up the table and index with: