I have a PG table with about 30M records. I'm noticing that when I query the table with IN
clause that would fetch lot of rows, the query is taking a long time to respond (~3 minutes) Question Although my query is using indices, how can I make the query faster?
mydb=> explain analyze select count(scores), sum(scores) from mytable
where classes in ('1st','2nd','3rd') and school_year in ('2018');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1150864.84..1150864.85 rows=1 width=40) (actual time=185454.773..185454.773 rows=1 loops=1)
-> Bitmap Heap Scan on mytable (cost=16998.63..1146247.22 rows=923525 width=6) (actual time=934.546..184704.246 rows=967663 loops=1)
Recheck Cond: ((classes)::text = ANY ('{1st,2nd,3rd}'::text[]))
Filter: (school_year = 2018)
Heap Blocks: exact=448386
-> Bitmap Index Scan on mytable_classes_idx (cost=0.00..16767.75 rows=923525 width=0) (actual time=736.690..736.690 rows=967663 loops=1)
Index Cond: ((classes)::text = ANY ('{1st,2nd,3rd}'::text[]))
Planning time: 1.252 ms
Execution time: 185463.153 ms
(9 rows)
Additional Details
version
--------------------------------------------------------------------------------------------------------
PostgreSQL 10.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
(1 row)
Best Answer
One odd thing is that the filter
Filter: (school_year = 2018)
doesn't remove any rows. So either the entire table hasschool_year = 2018
, or all the rows which meet the criteriaclasses in ('1st','2nd','3rd')
haveschool_year = 2018
. Either way, it seems a bit odd. Is this how your real problem is, or is it an artefact of creating a smaller demonstration problem?The slowness probably comes from the IO needed to read a large amount of the table. Doing the
EXPLAIN (ANALYZE, BUFFERS)
, especially with "track_io_timing" turned on, would be especially helpful to figure that out.If so, creating an index that includes all of the columns, like:
should make this much faster. It will allow an index-only scan, instead of an scan which still needs to visit the table for each row. But for this to be effective, the table must be kept well-vacuumed. The default settings of autovac will probably not be enough.