Postgresql – How to optimize this query on Postgres

optimizationperformancepostgresqlquery-performance

I'm trying to make a relatively simple query, but it's taking much longer than I'd expect. I have an index in place, but it doesn't seem to be helping much.

Here's the query. It sometimes takes up to 20-30 seconds to execute:

SELECT "schedules".*
FROM "schedules"
WHERE "schedules"."client_data_provider_id" = '3001-753'
AND "schedules"."practice_id" = 753
AND (date_scheduled >= '2016-01-12')

Here's the table, with indices:

                                          Table "public.schedules"
        Column              |            Type             |                       Modifiers
----------------------------+-----------------------------+--------------------------------------------------------
 id                         | integer                     | not null default nextval('schedules_id_seq'::regclass)
 practice_id                | integer                     | not null
 data_provider_id           | character varying(255)      | not null
 source                     | character varying(255)      |
 type                       | character varying(255)      |
 client_data_provider_id    | character varying(255)      | not null
 client_pms_id              | character varying(255)      | not null
 patient_data_provider_id   | character varying(255)      | not null
 patient_pms_id             | character varying(255)      | not null
 date_scheduled             | date                        | not null
 duration                   | integer                     |
 status                     | character varying(255)      |
 reason                     | text                        |
 notes                      | text                        |
 resource_id                | character varying(255)      |
 resource_name              | character varying(255)      |
 site_id                    | integer                     |
 api_create_date            | timestamp without time zone |
 api_last_change_date       | timestamp without time zone |
 api_removed_date           | timestamp without time zone |
Indexes:
    "schedules_pkey" PRIMARY KEY, btree (id)
    "index_schedules_change_date_for_query" btree (practice_id, api_last_change_date DESC) WHERE api_last_change_date IS NOT NULL
    "index_schedules_create_date_for_query" btree (practice_id, api_create_date DESC) WHERE api_create_date IS NOT NULL
    "index_schedules_on_api_last_change_date" btree (api_last_change_date)
    "index_schedules_on_client_data_provider_id_and_date_for_query" btree (practice_id, date_scheduled, client_data_provider_id)
    "index_schedules_on_practice_id" btree (practice_id)
    "index_schedules_on_practice_id_and_date_scheduled" btree (practice_id, date_scheduled)
    "index_schedules_on_data_provider_id" btree (data_provider_id)

Here's the results of explain on the query:

=> EXPLAIN for: SELECT "schedules".* FROM "schedules"  WHERE "schedules"."client_data_provider_id" = '3001-753' AND "schedules"."practice_id" = 753 AND (date_scheduled >= '2016-01-12')
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_schedules_on_client_data_provider_id_and_date_for_query on schedules  (cost=0.08..14.49 rows=1 width=706)
   Index Cond: ((practice_id = 753) AND (date_scheduled >= '2016-01-12'::date) AND ((client_data_provider_id)::text = '3001-753'::text))
(2 rows)

The schedules table has almost 4.5 million records. Only about 120,000 records have practice_id = 466, and 55 records with client_data_provider_id = '3001-753'

As you can see, there are multiple indices on the table, including one that I thought would work specifically for this query (index_schedules_on_client_data_provider_id_and_date_for_query).

We're aiming for <1 sec response. 20-30 is way too long. How would I go about improving the performance of this query?

Best Answer

An index on (practice_id, client_data_provider_id, date_scheduled) would be better than the one you have (practice_id, date_scheduled, client_data_provider_id), for this particular query.

Notice the difference in order. When there are multiple equality (=) conditions and one range condition (>=, >, between, etc) in the where clause, it's better to have an index with first the columns that are checked with equality and last the column in the range condition.

This way, the index scan will have to cover a much smaller part of the index, only the values with practice_id = 753 and client_data_provider_id = '3001-753' and date_scheduled >= '2016-01-12', which is exactly the rows you want.

With the current index, it will have to scan the part of the index with practice_id = 753 and date_scheduled >= '2016-01-12' and then reject the largest part (all those that don't have client_data_provider_id = '3001-753').