Postgresql – How to optimize this query on Postgres

optimizationperformancepostgresqlquery-performance

I'm trying to make a relatively simple query, but it's taking much longer than I'd expect. I have an index in place, but it doesn't seem to be helping much.

Here's the query. It sometimes takes up to 20-30 seconds to execute:

SELECT "schedules".*
FROM "schedules"
WHERE "schedules"."client_data_provider_id" = '3001-753'
AND "schedules"."practice_id" = 753
AND (date_scheduled >= '2016-01-12')

Here's the table, with indices:

                                          Table "public.schedules"
        Column              |            Type             |                       Modifiers
----------------------------+-----------------------------+--------------------------------------------------------
 id                         | integer                     | not null default nextval('schedules_id_seq'::regclass)
 practice_id                | integer                     | not null
 data_provider_id           | character varying(255)      | not null
 source                     | character varying(255)      |
 type                       | character varying(255)      |
 client_data_provider_id    | character varying(255)      | not null
 client_pms_id              | character varying(255)      | not null
 patient_data_provider_id   | character varying(255)      | not null
 patient_pms_id             | character varying(255)      | not null
 date_scheduled             | date                        | not null
 duration                   | integer                     |
 status                     | character varying(255)      |
 reason                     | text                        |
 notes                      | text                        |
 resource_id                | character varying(255)      |
 resource_name              | character varying(255)      |
 site_id                    | integer                     |
 api_create_date            | timestamp without time zone |
 api_last_change_date       | timestamp without time zone |
 api_removed_date           | timestamp without time zone |
Indexes:
    "schedules_pkey" PRIMARY KEY, btree (id)
    "index_schedules_change_date_for_query" btree (practice_id, api_last_change_date DESC) WHERE api_last_change_date IS NOT NULL
    "index_schedules_create_date_for_query" btree (practice_id, api_create_date DESC) WHERE api_create_date IS NOT NULL
    "index_schedules_on_api_last_change_date" btree (api_last_change_date)
    "index_schedules_on_client_data_provider_id_and_date_for_query" btree (practice_id, date_scheduled, client_data_provider_id)
    "index_schedules_on_practice_id" btree (practice_id)
    "index_schedules_on_practice_id_and_date_scheduled" btree (practice_id, date_scheduled)
    "index_schedules_on_data_provider_id" btree (data_provider_id)

Here's the results of explain on the query:

=> EXPLAIN for: SELECT "schedules".* FROM "schedules"  WHERE "schedules"."client_data_provider_id" = '3001-753' AND "schedules"."practice_id" = 753 AND (date_scheduled >= '2016-01-12')
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_schedules_on_client_data_provider_id_and_date_for_query on schedules  (cost=0.08..14.49 rows=1 width=706)
   Index Cond: ((practice_id = 753) AND (date_scheduled >= '2016-01-12'::date) AND ((client_data_provider_id)::text = '3001-753'::text))
(2 rows)

The schedules table has almost 4.5 million records. Only about 120,000 records have practice_id = 466, and 55 records with client_data_provider_id = '3001-753'

As you can see, there are multiple indices on the table, including one that I thought would work specifically for this query (index_schedules_on_client_data_provider_id_and_date_for_query).

We're aiming for <1 sec response. 20-30 is way too long. How would I go about improving the performance of this query?

Best Answer

An index on (practice_id, client_data_provider_id, date_scheduled) would be better than the one you have (practice_id, date_scheduled, client_data_provider_id), for this particular query.

Notice the difference in order. When there are multiple equality (=) conditions and one range condition (>=, >, between, etc) in the where clause, it's better to have an index with first the columns that are checked with equality and last the column in the range condition.

This way, the index scan will have to cover a much smaller part of the index, only the values with practice_id = 753 and client_data_provider_id = '3001-753' and date_scheduled >= '2016-01-12', which is exactly the rows you want.

With the current index, it will have to scan the part of the index with practice_id = 753 and date_scheduled >= '2016-01-12' and then reject the largest part (all those that don't have client_data_provider_id = '3001-753').

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Would a different kind of column be faster? For example an integer

No. timestamp and timestamptz are just unsigned 64-bit integers internally anyway.

Is there some way to not lock the column?

It doesn't lock the column. It takes weak table lock that doesn't really block anything except DDL, and takes a row level lock on the row you're updating.

There is no way to prevent the row level lock. It exists because without it behaviour and ordering concurrent updates would be undefined. We don't like undefined behaviour in RDBMSs.

It only blocks concurrent updates of the same row anyway.

Any other tips to improve this?

Not with the detail provided. There's likely a better way to do what you're trying to do, but it'll probably involve taking a few steps back and looking for a different strategy for solving the underlying problem.

In the specific case of cache invalidation I think you might want to look into LISTEN and NOTIFY. Again though, there just isn't enough info here to go on.

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Related Question