Postgresql – How to create an index to improve this query

postgresql

I am trying to improve the speed of the following query:

SELECT COUNT(*) AS count, 
       SUM(amount) AS total, 
       AVG(case when service_years >= 30 then amount end) AS average, 
       employer 
FROM pensions_pension 
WHERE agency_id = 4 
GROUP BY employer 
ORDER BY count DESC NULLS LAST;

Table structure:

id: integer (primary key)
name: text
slug: varchar(50)
title: text
employer: text
pension: numeric(10,2)
benefits: numeric(10,2) 
disability: numeric(10,2)
amount: numeric(10,2)
service_years: numeric(10,2)
retirement_year: smallint
year: smallint
agency_id: integer (foreign key)
rank: smallint
note_text: text

Existing index definitions:

CREATE UNIQUE INDEX pensions_pension_pkey ON public.pensions_pension USING btree (id)
CREATE INDEX employer ON public.pensions_pension USING btree (employer)
CREATE INDEX pensions_pension_agency_id ON public.pensions_pension USING btree (agency_id)
CREATE INDEX pensions_pension_amount ON public.pensions_pension USING btree (amount DESC)
CREATE INDEX pensions_pension_employer ON public.pensions_pension USING btree (employer)
CREATE INDEX pensions_pension_rank_idx ON public.pensions_pension USING btree (rank, amount DESC)
CREATE INDEX pensions_pension_slug ON public.pensions_pension USING btree (slug)
CREATE INDEX pensions_pension_slug_like ON public.pensions_pension USING btree (slug varchar_pattern_ops)
CREATE INDEX pensions_pension_year ON public.pensions_pension USING btree (year)
CREATE INDEX year_amount ON public.pensions_pension USING btree (year DESC, amount DESC)
CREATE INDEX pensions_pension_year_29d1e58f8ad8ed60_uniq ON public.pensions_pension USING btree (year)
CREATE INDEX idx_pensions_pension_year_agency_id ON public.pensions_pension USING btree (year, agency_id) WHERE ((year <= 2016) AND (year IS NULL))
CREATE INDEX pensions_pension_idx_agency_id ON public.pensions_pension USING btree (agency_id)
CREATE INDEX pensions_pension_partial_agency_id__index ON public.pensions_pension USING btree (agency_id, employer) WHERE (agency_id = 4)

This is the query plan when I run ANALYZE EXPLAIN in postgresql:

Sort  (cost=215259.09..215266.48 rows=2956 width=96) (actual time=9976.097..9976.672 rows=3527 loops=1)
  Sort Key: (count(*)) DESC NULLS LAST
  Sort Method: quicksort  Memory: 551kB
  ->  HashAggregate  (cost=215044.35..215088.69 rows=2956 width=96) (actual time=9968.124..9973.413 rows=3527 loops=1)
        Group Key: employer
        ->  Seq Scan on pensions_pension  (cost=0.00..178968.04 rows=2886105 width=38) (actual time=0.054..5236.262 rows=2942951 loops=1)
              Filter: (agency_id = 4)
              Rows Removed by Filter: 2777772
Planning time: 0.443 ms
Execution time: 9983.167 ms

I tried to add an index on agency_id. I also tried creating a partial index on agency_id by setting it to agency_id=4. But the query planner keeps choosing a seq scan instead of an index scan. I'm fairly new to creating indexes and would like to learn about the process and how I can speed up something like this. I am running Postgresql 9.6.

Best Answer

There is no way to test this without your actual data. Theoretically, your optimal index for this query will be on:

agency_id, employer, amount, service_years

This index should allow immediate push-down filtering on agency_id, followed by a stream aggregate instead of the hash aggregate by employer, and the last 2 are just to cover the query, and avoid lookups to the main table to fetch amount and service_years which the query needs for the aggregates.

Generally speaking, when designing indexes, you should consider more than one query, and see if you can find common query patterns that will allow you to create near optimal indexes that will serve multiple types of queries. The way too common indexing strategy of 'Let's create an optimal index for every painful query' tends to lead to over-indexed databases, which will pose a set of challenges of its own, including modification overhead, and over-sized optimizer search trees.

Let us know if the optimizer uses this index efficiently.