PostgreSql changes same query plan at runtime

execution-planpostgresqlpostgresql-performance

im having a really weird issue here. We have search query that receives multiple parameters and searches by multiple joins in different tables. The query works fine first few time we run it and then sometimes it would slow down terribly for example: From 200ms to 15000/20000ms. On the same exact query.

Initially i thought it was a backend ORM issue but after checking it thoroughly and turning on slow query log, turning on auto_explain feature i realized the query plan changes after few calls. We put a loop in backend that executes query on the same thread mulitple times(for testing) and i discovered that almost always after 8 executions db starts logging slow query. And finally it turned out query plan changes.
Most important change being(i think) on TimeSlots table, from seq scan to Bitmap index scan which estimates much fewer rows than there is in a table. estimate: 566 actual: 100 000 ish.

Things i tried beforehand:

  • Upgrading Hardware from 2core 6GB to 8 core 26gb ram (100GB SSD)
  • Disabling any other application(thus stopping any other load) from reaching it.
  • performing manual(there are also automated jobs of them running) VACUUM, ANALYZE, REINDEX-ing on the tables.
  • Setting statistics for various columns, such as SlotDateTime (to 10000)
  • Creating ndistinct statistics.
  • Increased work_mem, maintenance_work_mem, shared_buffers size.
  • Decreased seq_page_cost to 1.

Here is the select statement:

SELECT DISTINCT u."UserId"     AS                                     UserId,
                             cast(cuj.company_id as bigint) AS                                     CompanyId,
                         cast(ts.SlotsCount as bigint) as SlotsCount,
                         case when cuj.vip is null then false else cuj.vip end vip,
                         u.is_vip       as                                     userVip,

                         case
                             when ts2.EarliestDate isnull then '7777-12-01 21:00:00.000000'
                             else ts2.EarliestDate end                         EarliestDate
         from "Users" u
             left JOIN "UserSpecialty" usp ON usp."UserId" = u."UserId"
             left join "company_users_join" cuj ON cuj.user_id = u."UserId"  and cuj.company_user_role =2 and cuj.status in(1,3)
             LEFT JOIN "company_users_join_providers" cujp
              ON cuj.id = cujp.company_users_join_id
             left join ( SELECT T."DoctorId", dse.company_user_join_id, COALESCE(COUNT(*),0)  as SlotsCount
                          FROM "TimeSlots" T
                                  inner JOIN
                               "DoctorSchedule" dse
                               ON T."DoctorScheduleId" = dse."DoctorScheduleId"
                          WHERE (T."SlotDateTime" > :dateStart
                              AND T."SlotDateTime" < :dateEnd)
                            and (:specialtyId = 0 or t."SpecialityId" = :specialtyId) 
                            and (:specialtyId = 0 or dse."SpecialityId" = :specialtyId)
                            AND (dse."EndDate" < T."SlotDateTime"
                              OR dse."EndDate" IS NULL)
                            AND T."DeleteDate" IS NULL
                            AND T."IsActive" = TRUE
                            and dse."DeleteDate" is null
                          GROUP BY T."DoctorId",dse.company_user_join_id) as ts
                 on ts."DoctorId" = u."UserId" and  ((ts.company_user_join_id is null and cuj.company_id is null) or ts.company_user_join_id = cuj.id) 
         left join (SELECT t1."DoctorId", dse.company_user_join_id, min(t1."SlotDateTime") as EarliestDate
               FROM "TimeSlots" t1
                       inner JOIN
                    "DoctorSchedule" dse
                    ON t1."DoctorScheduleId" = dse."DoctorScheduleId"
               WHERE (
                       t1."SlotDateTime" > :dateStart
                       AND t1."SlotDateTime" < :dateEnd
                   )
                 and (:specialtyId =0 or t1."SpecialityId" = :specialtyId) 
                 AND (
                       dse."EndDate" < t1."SlotDateTime"
                       OR dse."EndDate" IS NULL
                   )
                 and(:serviceTypeId = 0 or t1."ServiceTypeId" = :serviceTypeId ) 
                 AND t1."DeleteDate" IS NULL
                 AND dse."DeleteDate" is null
                 AND t1."IsActive" = TRUE
               GROUP BY t1."DoctorId", dse.company_user_join_id) ts2
              on ts2."DoctorId" = u."UserId" and  ((ts2.company_user_join_id is null and cuj.company_id is null) or ts2.company_user_join_id = cuj.id) 
            and(:companyId =0 or cuj.company_id = :companyId) 
           and(:providerId =0 or cujp.provider_id = :providerId)
           and(:specialtyId =0 or usp."SpecialtyId" = :specialtyId) 
           and (:onlyForeignDoctors = false or u.is_foreign_doctor = true)
           and (:onlyForeignDoctors = true or LOWER(u."FullName") = :fullName or (u.is_foreign_doctor is null or u.is_foreign_doctor = false))
           and  LOWER(u."FullName") LIKE concat('%', :fullName, '%')
           and ((length(:fullName) =0 and ts.SlotsCount >0) or (length(:fullName) >0 ))
        and ts.SlotsCount >=0
       ORDER BY  userVip desc, vip desc, EarliestDate asc, SlotsCount desc   offset 0 limit 20

Here is the query plan at the beggining(fast one):

Limit  (cost=9689.40..9689.42 rows=1 width=30) (actual time=127.650..127.682 rows=20 loops=1)
      Buffers: shared hit=2229 read=2013
      ->  Unique  (cost=9689.40..9689.42 rows=1 width=30) (actual time=127.639..127.668 rows=20 loops=1)
            Buffers: shared hit=2229 read=2013
            ->  Sort  (cost=9689.40..9689.40 rows=1 width=30) (actual time=127.637..127.653 rows=51 loops=1)
                  Sort Key: u.is_vip DESC, (CASE WHEN (cuj.vip IS NULL) THEN false ELSE cuj.vip END) DESC, (CASE WHEN ((min(t1."SlotDateTime")) IS NULL) THEN '7777-12-01 21:00:00'::timestamp without time zone ELSE (min(t1."SlotDateTime")) END), (COALESCE(count(*), '0'::bigint)) DESC, u."UserId", cuj.company_id
                  Sort Method: quicksort  Memory: 57kB
                  Buffers: shared hit=2229 read=2013
                  ->  Nested Loop Left Join  (cost=9297.95..9689.39 rows=1 width=30) (actual time=122.171..127.367 rows=419 loops=1)
                        Join Filter: (cuj.id = cujp.company_users_join_id)
                        Rows Removed by Join Filter: 26015
                        Buffers: shared hit=2215 read=2013
                        ->  Hash Right Join  (cost=9297.95..9686.68 rows=1 width=38) (actual time=122.107..122.421 rows=391 loops=1)
                              Hash Cond: (t1."DoctorId" = u."UserId")
                              Join Filter: (((dse.company_user_join_id IS NULL) AND (cuj.company_id IS NULL)) OR (dse.company_user_join_id = cuj.id))
                              Rows Removed by Join Filter: 26
                              Buffers: shared hit=1825 read=2012
                              ->  HashAggregate  (cost=4100.24..4263.91 rows=16367 width=20) (actual time=37.779..37.953 rows=133 loops=1)
                                    Group Key: t1."DoctorId", dse.company_user_join_id
                                    Batches: 1  Memory Usage: 801kB
                                    Buffers: shared hit=1268 read=53
                                    ->  Hash Join  (cost=853.30..3977.49 rows=16367 width=20) (actual time=3.710..28.964 rows=45298 loops=1)
                                          Hash Cond: (t1."DoctorScheduleId" = dse."DoctorScheduleId")
                                          Join Filter: ((dse."EndDate" < t1."SlotDateTime") OR (dse."EndDate" IS NULL))
                                          Buffers: shared hit=1268 read=53
                                          ->  Bitmap Heap Scan on "TimeSlots" t1  (cost=624.79..3638.82 rows=41926 width=16) (actual time=2.127..16.859 rows=45300 loops=1)
                                                Recheck Cond: ("ServiceTypeId" = 1)
                                                Filter: (("DeleteDate" IS NULL) AND "IsActive" AND ("SlotDateTime" > '2021-03-29 23:22:36.152'::timestamp without time zone) AND ("SlotDateTime" < '2021-04-28 23:22:36.152'::timestamp without time zone))
                                                Rows Removed by Filter: 8302
                                                Heap Blocks: exact=1157
                                                Buffers: shared hit=1157 read=53
                                                ->  Bitmap Index Scan on timeslots4  (cost=0.00..614.31 rows=53602 width=0) (actual time=1.960..1.961 rows=53602 loops=1)
                                                      Index Cond: ("ServiceTypeId" = 1)
                                                      Buffers: shared read=53
                                          ->  Hash  (cost=197.21..197.21 rows=2504 width=20) (actual time=1.537..1.538 rows=2504 loops=1)
                                                Buckets: 4096  Batches: 1  Memory Usage: 132kB
                                                Buffers: shared hit=111
                                                ->  Seq Scan on "DoctorSchedule" dse  (cost=0.00..197.21 rows=2504 width=20) (actual time=0.009..1.110 rows=2504 loops=1)
                                                      Filter: ("DeleteDate" IS NULL)
                                                      Rows Removed by Filter: 3897
                                                      Buffers: shared hit=111
                              ->  Hash  (cost=5197.70..5197.70 rows=1 width=30) (actual time=84.315..84.322 rows=391 loops=1)
                                    Buckets: 1024  Batches: 1  Memory Usage: 35kB
                                    Buffers: shared hit=557 read=1959
                                    ->  Nested Loop Left Join  (cost=4728.99..5197.70 rows=1 width=30) (actual time=83.475..84.219 rows=391 loops=1)
                                          Buffers: shared hit=557 read=1959
                                          ->  Nested Loop Left Join  (cost=4728.72..5197.30 rows=1 width=30) (actual time=83.461..83.934 rows=138 loops=1)
                                                Filter: (((dse_1.company_user_join_id IS NULL) AND (cuj.company_id IS NULL)) OR (dse_1.company_user_join_id = cuj.id))
                                                Rows Removed by Filter: 8
                                                Buffers: shared hit=280 read=1959
                                                ->  Hash Join  (cost=4728.57..5186.85 rows=39 width=21) (actual time=83.411..83.619 rows=139 loops=1)
                                                      Hash Cond: (t."DoctorId" = u."UserId")
                                                      Buffers: shared hit=2 read=1956
                                                      ->  HashAggregate  (cost=4668.48..5087.57 rows=3104 width=20) (actual time=81.842..82.001 rows=147 loops=1)
                                                            Group Key: t."DoctorId", dse_1.company_user_join_id
                                                            Filter: ((COALESCE(count(*), '0'::bigint) > 0) AND (COALESCE(count(*), '0'::bigint) >= 0))
                                                            Batches: 1  Memory Usage: 801kB
                                                            Buffers: shared hit=2 read=1839
                                                            ->  Hash Join  (cost=228.51..4236.18 rows=34584 width=12) (actual time=2.341..63.449 rows=95524 loops=1)
                                                                  Hash Cond: (t."DoctorScheduleId" = dse_1."DoctorScheduleId")
                                                                  Join Filter: ((dse_1."EndDate" < t."SlotDateTime") OR (dse_1."EndDate" IS NULL))
                                                                  Buffers: shared hit=2 read=1839
                                                                  ->  Seq Scan on "TimeSlots" t  (cost=0.00..3774.93 rows=88591 width=16) (actual time=0.018..38.614 rows=95531 loops=1)
                                                                        Filter: (("DeleteDate" IS NULL) AND "IsActive" AND ("SlotDateTime" > '2021-03-29 23:22:36.152'::timestamp without time zone) AND ("SlotDateTime" < '2021-04-28 23:22:36.152'::timestamp without time zone))
                                                                        Rows Removed by Filter: 17731
                                                                        Buffers: shared hit=2 read=1728
                                                                  ->  Hash  (cost=197.21..197.21 rows=2504 width=20) (actual time=2.293..2.294 rows=2504 loops=1)
                                                                        Buckets: 4096  Batches: 1  Memory Usage: 132kB
                                                                        Buffers: shared read=111
                                                                        ->  Seq Scan on "DoctorSchedule" dse_1  (cost=0.00..197.21 rows=2504 width=20) (actual time=0.012..1.782 rows=2504 loops=1)
                                                                              Filter: ("DeleteDate" IS NULL)
                                                                              Rows Removed by Filter: 3897
                                                                              Buffers: shared read=111
                                                      ->  Hash  (cost=57.25..57.25 rows=227 width=5) (actual time=1.560..1.561 rows=218 loops=1)
                                                            Buckets: 1024  Batches: 1  Memory Usage: 16kB
                                                            Buffers: shared read=117
                                                            ->  Index Scan using users__index_is_published on "Users" u  (cost=0.29..57.25 rows=227 width=5) (actual time=0.180..1.507 rows=218 loops=1)
                                                                  Index Cond: ("IsPublished" = true)
                                                                  Filter: (((lower(("FullName")::text) = ''::text) OR (is_foreign_doctor IS NULL) OR (NOT is_foreign_doctor)) AND (lower(("FullName")::text) ~~ concat('%', ''::character varying, '%')))
                                                                  Rows Removed by Filter: 9
                                                                  Buffers: shared read=117
                                                ->  Index Scan using company_users_join__index_sch1 on company_users_join cuj  (cost=0.14..0.26 rows=1 width=21) (actual time=0.002..0.002 rows=1 loops=139)
                                                      Index Cond: ((user_id = u."UserId") AND (company_user_role = 2))
                                                      Filter: (status = ANY ('{1,3}'::integer[]))
                                                      Buffers: shared hit=278 read=3
                                          ->  Index Only Scan using us1 on "UserSpecialty" usp  (cost=0.28..0.38 rows=2 width=4) (actual time=0.001..0.001 rows=3 loops=138)
                                                Index Cond: ("UserId" = u."UserId")
                                                Heap Fetches: 0
                                                Buffers: shared hit=277
                        ->  Seq Scan on company_users_join_providers cujp  (cost=0.00..1.87 rows=67 width=8) (actual time=0.001..0.006 rows=67 loops=391)
                              Buffers: shared hit=390 read=1

This is the slow one

Limit  (cost=2664.61..2664.63 rows=1 width=30) (actual time=377.322..377.348 rows=20 loops=1)
      Buffers: shared hit=383194
      ->  Unique  (cost=2664.61..2664.63 rows=1 width=30) (actual time=377.320..377.343 rows=20 loops=1)
            Buffers: shared hit=383194
            ->  Sort  (cost=2664.61..2664.62 rows=1 width=30) (actual time=377.319..377.330 rows=51 loops=1)
                  Sort Key: u.is_vip DESC, (CASE WHEN (cuj.vip IS NULL) THEN false ELSE cuj.vip END) DESC, (CASE WHEN ((min(t1."SlotDateTime")) IS NULL) THEN '7777-12-01 21:00:00'::timestamp without time zone ELSE (min(t1."SlotDateTime")) END), (COALESCE(count(*), '0'::bigint)) DESC, u."UserId", cuj.company_id
                  Sort Method: quicksort  Memory: 57kB
                  Buffers: shared hit=383194
                  ->  Merge Left Join  (cost=2629.35..2664.60 rows=1 width=30) (actual time=318.926..377.019 rows=419 loops=1)
                        Merge Cond: (u."UserId" = t1."DoctorId")
                        Join Filter: (((dse_1.company_user_join_id IS NULL) AND (cuj.company_id IS NULL)) OR (dse_1.company_user_join_id = cuj.id))
                        Rows Removed by Join Filter: 26
                        Buffers: shared hit=383194
                        ->  Nested Loop Left Join  (cost=1363.46..1398.63 rows=1 width=30) (actual time=206.475..254.659 rows=419 loops=1)
                              Join Filter: (cuj.id = cujp.company_users_join_id)
                              Rows Removed by Join Filter: 26015
                              Filter: (($15 = 0) OR (cujp.provider_id = $16))
                              Buffers: shared hit=290878
                              ->  Nested Loop Left Join  (cost=1363.46..1395.58 rows=1 width=30) (actual time=206.460..249.574 rows=391 loops=1)
                                    Join Filter: (cuj.user_id = u."UserId")
                                    Rows Removed by Join Filter: 89798
                                    Filter: ((((dse.company_user_join_id IS NULL) AND (cuj.company_id IS NULL)) OR (dse.company_user_join_id = cuj.id)) AND (($13 = 0) OR (cuj.company_id = $14)))
                                    Rows Removed by Filter: 37
                                    Buffers: shared hit=290487
                                    ->  Nested Loop Left Join  (cost=1363.46..1384.83 rows=1 width=21) (actual time=206.445..226.437 rows=394 loops=1)
                                          Filter: (($17 = 0) OR (usp."SpecialtyId" = $18))
                                          Buffers: shared hit=289305
                                          ->  Nested Loop  (cost=1359.17..1375.28 rows=1 width=21) (actual time=206.418..225.252 rows=139 loops=1)
                                                Buffers: shared hit=288749
                                                ->  GroupAggregate  (cost=1358.88..1358.93 rows=1 width=20) (actual time=206.366..224.117 rows=147 loops=1)
                                                      Group Key: t."DoctorId", dse.company_user_join_id
                                                      Filter: ((COALESCE(count(*), '0'::bigint) >= 0) AND (((length(($23)::text) = 0) AND (COALESCE(count(*), '0'::bigint) > 0)) OR (length(($24)::text) > 0)))
                                                      Buffers: shared hit=288308
                                                      ->  Sort  (cost=1358.88..1358.89 rows=1 width=12) (actual time=205.791..212.762 rows=95524 loops=1)
                                                            Sort Key: t."DoctorId", dse.company_user_join_id
                                                            Sort Method: quicksort  Memory: 7550kB
                                                            Buffers: shared hit=288308
                                                            ->  Nested Loop  (cost=10.24..1358.87 rows=1 width=12) (actual time=3.708..181.475 rows=95524 loops=1)
                                                                  Buffers: shared hit=288308
                                                                  ->  Bitmap Heap Scan on "TimeSlots" t  (cost=9.96..1245.79 rows=16 width=16) (actual time=3.687..31.267 rows=95531 loops=1)
                                                                        Recheck Cond: (("SlotDateTime" > $1) AND ("SlotDateTime" < $2))
                                                                        Filter: (("DeleteDate" IS NULL) AND "IsActive" AND (($3 = 0) OR ("SpecialityId" = $4)))
                                                                        Rows Removed by Filter: 389
                                                                        Heap Blocks: exact=1627
                                                                        Buffers: shared hit=1715
                                                                        ->  Bitmap Index Scan on timeslots__index_slotdatetime  (cost=0.00..9.95 rows=566 width=0) (actual time=3.468..3.469 rows=95920 loops=1)
                                                                              Index Cond: (("SlotDateTime" > $1) AND ("SlotDateTime" < $2))
                                                                              Buffers: shared hit=88
                                                                  ->  Index Scan using sch1 on "DoctorSchedule" dse  (cost=0.28..7.06 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=95531)
                                                                        Index Cond: ("DoctorScheduleId" = t."DoctorScheduleId")
                                                                        Filter: (("DeleteDate" IS NULL) AND (("EndDate" < t."SlotDateTime") OR ("EndDate" IS NULL)) AND (($5 = 0) OR ("SpecialityId" = $6)))
                                                                        Rows Removed by Filter: 0
                                                                        Buffers: shared hit=286593
                                                ->  Index Scan using users_pk on "Users" u  (cost=0.29..8.32 rows=1 width=5) (actual time=0.007..0.007 rows=1 loops=147)
                                                      Index Cond: ("UserId" = t."DoctorId")
                                                      Filter: ("IsPublished" AND ((NOT $19) OR is_foreign_doctor) AND ($20 OR (lower(("FullName")::text) = ($21)::text) OR (is_foreign_doctor IS NULL) OR (NOT is_foreign_doctor)) AND (lower(("FullName")::text) ~~ concat('%', $22, '%')))
                                                      Rows Removed by Filter: 0
                                                      Buffers: shared hit=441
                                          ->  Bitmap Heap Scan on "UserSpecialty" usp  (cost=4.29..9.52 rows=2 width=8) (actual time=0.004..0.006 rows=3 loops=139)
                                                Recheck Cond: ("UserId" = u."UserId")
                                                Heap Blocks: exact=278
                                                Buffers: shared hit=556
                                                ->  Bitmap Index Scan on us1  (cost=0.00..4.29 rows=2 width=0) (actual time=0.003..0.003 rows=3 loops=139)
                                                      Index Cond: ("UserId" = u."UserId")
                                                      Buffers: shared hit=278
                                    ->  Seq Scan on company_users_join cuj  (cost=0.00..7.09 rows=183 width=21) (actual time=0.001..0.035 rows=229 loops=394)
                                          Filter: ((status = ANY ('{1,3}'::integer[])) AND (company_user_role = 2))
                                          Rows Removed by Filter: 4
                                          Buffers: shared hit=1182
                              ->  Seq Scan on company_users_join_providers cujp  (cost=0.00..1.87 rows=67 width=12) (actual time=0.001..0.006 rows=67 loops=391)
                                    Buffers: shared hit=391
                        ->  Materialize  (cost=1265.89..1265.96 rows=2 width=20) (actual time=112.444..122.087 rows=419 loops=1)
                              Buffers: shared hit=92316
                              ->  GroupAggregate  (cost=1265.89..1265.93 rows=2 width=20) (actual time=112.439..121.995 rows=133 loops=1)
                                    Group Key: t1."DoctorId", dse_1.company_user_join_id
                                    Buffers: shared hit=92316
                                    ->  Sort  (cost=1265.89..1265.90 rows=2 width=20) (actual time=112.206..115.889 rows=45298 loops=1)
                                          Sort Key: t1."DoctorId", dse_1.company_user_join_id
                                          Sort Method: quicksort  Memory: 5017kB
                                          Buffers: shared hit=92316
                                          ->  Nested Loop  (cost=10.24..1265.88 rows=2 width=20) (actual time=3.520..99.380 rows=45298 loops=1)
                                                Buffers: shared hit=92316
                                                ->  Bitmap Heap Scan on "TimeSlots" t1  (cost=9.95..1248.62 rows=4 width=16) (actual time=3.496..30.505 rows=45300 loops=1)
                                                      Recheck Cond: (("SlotDateTime" > $7) AND ("SlotDateTime" < $8))
                                                      Filter: (("DeleteDate" IS NULL) AND "IsActive" AND (($9 = 0) OR ("SpecialityId" = $10)) AND (($11 = 0) OR ("ServiceTypeId" = $12)))
                                                      Rows Removed by Filter: 50620
                                                      Heap Blocks: exact=1627
                                                      Buffers: shared hit=1715
                                                      ->  Bitmap Index Scan on timeslots__index_slotdatetime  (cost=0.00..9.95 rows=566 width=0) (actual time=3.284..3.284 rows=95920 loops=1)
                                                            Index Cond: (("SlotDateTime" > $7) AND ("SlotDateTime" < $8))
                                                            Buffers: shared hit=88
                                                ->  Index Only Scan using doctorschedule__index_sch5 on "DoctorSchedule" dse_1  (cost=0.28..4.30 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=45300)
                                                      Index Cond: (("DoctorScheduleId" = t1."DoctorScheduleId") AND ("DeleteDate" IS NULL))
                                                      Filter: (("EndDate" < t1."SlotDateTime") OR ("EndDate" IS NULL))
                                                      Heap Fetches: 0
                                                      Buffers: shared hit=90601

Again, please remember both of those plans were generated in a matter of seconds using a loop in the beckend and all the parameters were same. And also second query plan used to be 12000 to 25000ms until i diceded to add ts.SlotsCount >=0 at end of the select statement.

As im backend developer and im kinda new to postgres and i realize there is a myriad ways to solve the problem from backend but i just dont understand how can exact the same query generate different query plan. And it switches to different query plan almost at 8th execution. Which is really weird. And i want to know whats happening a lot.

Best Answer

The problem is that in your case, PostgreSQL switches to a generic plan after the first 5 executions, and the generic plan performs worse. In particular, the estimate on the index scan on slotdatetime is terribly off in the generic plan.

PostgreSQL has a built-in heuristics form statements whose plans can be cached: it plans them with the actual parameter values for the first 5 times, and then it switches to a generic plan if that plan promises not to perform worse. Unfortunately the generic plan doesn't keep the promise in this case.

Since there is no good way to fix the generic plan, the best you can do is avoid it. You should upgrade to PostgreSQL v12 or better and set the parameter plan_cache_mode to force_custom_plan when you run this query. If the query is executed in a PL/pgSQL function, you can simply

ALTER FUNCTION myfunc(argtype1,argtype2) SET plan_cache_mode = force_custom_plan;

An alternative is to use dynamic SQL with EXECUTE, that will always use a custom plan.