PostgreSQL – Why Does an Extra WHERE Clause Make an Index-Only Scan Slow?

index-tuningpostgresqlpostgresql-9.5

I have a query that is being made 18x slower by the presence of a single WHERE clause. The query plan is an index-only scan and the clause in question looks at one of the fields in the index.

This is the query and the offending clause is update_id < 40581:

EXPLAIN (VERBOSE, ANALYZE, COSTS, BUFFERS)
WITH a AS (
  SELECT e_match.account_id, for_time, for_eid,
         originating_update_id, attr1, attr2, attr3
  FROM combinations combo
  INNER JOIN (SELECT account_id, time as for_time, eid AS for_eid,
              originating_update_id
              FROM entries
              WHERE account_id = '1'
                    AND time <= '2017-11-30 15:57:35.267405+00' AND
                    originating_update_id < 40581) e_match
  ON combo.account_id = e_match.account_id)
SELECT a.attr1, a.attr2, a.attr3, b.amount, for_time, for_eid,
        originating_update_id, b.eid = a.for_eid AS is_stored
FROM a
CROSS JOIN LATERAL (SELECT time, update_id, amount, eid
                    FROM updates
                    WHERE account_id = a.account_id AND
                          attr1 = a.attr1 AND
                          attr2 = a.attr2 AND
                          attr3 = a.attr3 AND
                          update_id < 40581 AND -- THIS CLAUSE MAKES IT SLOW
                          time <= a.for_time AND
                          (time < a.for_time OR
                           (time = a.for_time AND
                            originating_update_id <= a.originating_update_id))
                    ORDER BY time DESC,
                             originating_update_id DESC,
                             update_id DESC
                    LIMIT 1) b
ORDER BY for_time, originating_update_id, for_eid, attr1, attr2, attr3;

It takes 14241.977 ms to run, but if I remove the update_id < 40581 it runs in 797.601 ms.

The query is relatively complex. It joins a table of entries with a table of all updates for entries to use the latest entry for each of the known combinations of attributes. This is the loose-index-scan emulation trick. According to the query plan most of the time is spent in the SELECT on the updates table and it is that time that varies between the two versions.

The tables are structured as follows:

                          Table "public.updates"
        Column         |            Type             | Collation | Nullable | Default
-----------------------+-----------------------------+-----------+----------+---------
 account_id            | text                        |           | not null |
 attr1                 | text                        |           | not null |
 eid                   | bigint                      |           | not null |
 as_of_eid             | bigint                      |           | not null |
 update_id             | bigint                      |           | not null |
 originating_update_id | bigint                      |           | not null |
 time                  | timestamp without time zone |           | not null |
 attr2                 | integer                     |           | not null |
 attr3                 | text                        |           | not null |
 amount                | numeric                     |           | not null |
Indexes:
    "updates_pkey" PRIMARY KEY, btree (eid, as_of_eid, account_id, attr1)
    "updates_update_id_eid_account_id_attr1_key" UNIQUE CONSTRAINT, btree (update_id, eid, account_id, attr1)
    "updates_covering_idx2" btree (account_id, attr1, attr2, attr3, time DESC, update_id DESC, originating_update_id DESC, amount, eid)
    "updates_covering_idx3" btree (account_id, attr1, attr2, attr3, time DESC, originating_update_id DESC, update_id DESC, amount, eid)

                     Table "public.entries"
        Column         |            Type             | Collation | Nullable | Default
-----------------------+-----------------------------+-----------+----------+---------
 account_id            | text                        |           | not null |
 eid                   | bigint                      |           | not null |
 time                  | timestamp without time zone |           |          |
 originating_update_id | bigint                      |           |          |
Indexes:
    "entries2_pkey" PRIMARY KEY, btree (account_id, eid)
    "entries_by_time" btree (account_id, time, eid, originating_update_id)

    Table "public.combinations"
    Column    |  Type   | Collation | Nullable | Default
--------------+---------+-----------+----------+---------
 account_id   | text    |           | not null |
 attr1        | text    |           | not null |
 attr2        | integer |           | not null |
 attr3        | text    |           | not null |
Indexes:
    "combinations_pkey" PRIMARY KEY, btree (account_id, attr1, attr2, attr3)

This is the query plan from the slow run with the offending clause:

 Sort  (cost=65859.13..65983.86 rows=49889 width=107) (actual time=14237.002..14239.136 rows=12967 loops=1)
   Output: a.attr1, a.attr2, a.attr3, updates.amount, a.for_time, a.for_eid, a.originating_update_id, ((updates.eid = a.for_eid))
   Sort Key: a.for_time, a.originating_update_id, a.for_eid, a.attr1, a.attr2, a.attr3
   Sort Method: quicksort  Memory: 2208kB
   Buffers: shared hit=849601
   CTE a
     ->  Nested Loop  (cost=0.83..1279.76 rows=49889 width=71) (actual time=0.035..16.200 rows=18102 loops=1)
           Output: entries.account_id, entries.time, entries.eid, entries.originating_update_id, combo.attr1, combo.attr2, combo.attr3
           Buffers: shared hit=69
           ->  Index Only Scan using entries_by_time on public.entries  (cost=0.42..651.60 rows=7127 width=57) (actual time=0.021..1.895 rows=2586 loops=1)
                 Output: entries.account_id, entries.time, entries.eid, entries.originating_update_id
                 Index Cond: ((entries.account_id = '1'::text) AND (entries.time <= '2017-11-30 15:57:35.267405'::timestamp without time zone) AND (entries.originating_update_id < 40581))
                 Heap Fetches: 0
                 Buffers: shared hit=65
           ->  Materialize  (cost=0.41..4.57 rows=7 width=50) (actual time=0.000..0.002 rows=7 loops=2586)
                 Output: combo.attr1, combo.attr2, combo.attr3, combo.account_id
                 Buffers: shared hit=4
                 ->  Index Only Scan using combinations_pkey on public.combinations combo  (cost=0.41..4.53 rows=7 width=50) (actual time=0.010..0.012 rows=7 loops=1)
                       Output: combo.attr1, combo.attr2, combo.attr3, combo.account_id
                       Index Cond: (combo.account_id = '1'::text)
                       Heap Fetches: 0
                       Buffers: shared hit=4
   ->  Nested Loop  (cost=0.69..57784.93 rows=49889 width=107) (actual time=0.117..14217.115 rows=12967 loops=1)
         Output: a.attr1, a.attr2, a.attr3, updates.amount, a.for_time, a.for_eid, a.originating_update_id, (updates.eid = a.for_eid)
         Buffers: shared hit=849601
         ->  CTE Scan on a  (cost=0.00..997.78 rows=49889 width=124) (actual time=0.037..30.353 rows=18102 loops=1)
               Output: a.account_id, a.for_time, a.for_eid, a.originating_update_id, a.attr1, a.attr2, a.attr3
               Buffers: shared hit=69
         ->  Limit  (cost=0.69..1.12 rows=1 width=39) (actual time=0.783..0.783 rows=1 loops=18102)
               Output: updates.time, updates.update_id, updates.amount, updates.eid, updates.originating_update_id
               Buffers: shared hit=849532
               ->  Index Only Scan using updates_covering_idx3 on public.updates  (cost=0.69..777.72 rows=1825 width=39) (actual time=0.782..0.782 rows=1 loops=18102)
                     Output: updates.time, updates.update_id, updates.amount, updates.eid, updates.originating_update_id
                     Index Cond: ((updates.account_id = a.account_id) AND (updates.attr1 = a.attr1) AND (updates.attr2 = a.attr2) AND (updates.attr3 = a.attr3) AND (updates.time <= a.for_time) AND (updates.update_id < 40581))
                     Filter: ((updates.time < a.for_time) OR ((updates.time = a.for_time) AND (updates.originating_update_id <= a.originating_update_id)))
                     Rows Removed by Filter: 6
                     Heap Fetches: 0
                     Buffers: shared hit=849532
 Planning time: 0.446 ms
 Execution time: 14241.977 ms
(40 rows)

and the fast run without the clause:

 Sort  (cost=58914.11..59038.83 rows=49889 width=107) (actual time=792.649..794.853 rows=12970 loops=1)
   Output: a.attr1, a.attr2, a.attr3, updates.amount, a.for_time, a.for_eid, a.originating_update_id, ((updates.eid = a.for_eid))
   Sort Key: a.for_time, a.originating_update_id, a.for_eid, a.attr1, a.attr2, a.attr3
   Sort Method: quicksort  Memory: 2208kB
   Buffers: shared hit=175954
   CTE a
     ->  Nested Loop  (cost=0.83..1279.76 rows=49889 width=71) (actual time=0.036..15.278 rows=18102 loops=1)
           Output: entries.account_id, entries.time, entries.eid, entries.originating_update_id, combo.attr1, combo.attr2, combo.attr3
           Buffers: shared hit=69
           ->  Index Only Scan using entries_by_time on public.entries  (cost=0.42..651.60 rows=7127 width=57) (actual time=0.022..1.675 rows=2586 loops=1)
                 Output: entries.account_id, entries.time, entries.eid, entries.originating_update_id
                 Index Cond: ((entries.account_id = '1'::text) AND (entries.time <= '2017-11-30 15:57:35.267405'::timestamp without time zone) AND (entries.originating_update_id < 40581))
                 Heap Fetches: 0
                 Buffers: shared hit=65
           ->  Materialize  (cost=0.41..4.57 rows=7 width=50) (actual time=0.000..0.002 rows=7 loops=2586)
                 Output: combo.attr1, combo.attr2, combo.attr3, combo.account_id
                 Buffers: shared hit=4
                 ->  Index Only Scan using combinations_pkey on public.combinations combo  (cost=0.41..4.53 rows=7 width=50) (actual time=0.011..0.014 rows=7 loops=1)
                       Output: combo.attr1, combo.attr2, combo.attr3, combo.account_id
                       Index Cond: (combo.account_id = '1'::text)
                       Heap Fetches: 0
                       Buffers: shared hit=4
   ->  Nested Loop  (cost=0.69..50839.90 rows=49889 width=107) (actual time=0.115..774.342 rows=12970 loops=1)
         Output: a.attr1, a.attr2, a.attr3, updates.amount, a.for_time, a.for_eid, a.originating_update_id, (updates.eid = a.for_eid)
         Buffers: shared hit=175954
         ->  CTE Scan on a  (cost=0.00..997.78 rows=49889 width=124) (actual time=0.039..28.176 rows=18102 loops=1)
               Output: a.account_id, a.for_time, a.for_eid, a.originating_update_id, a.attr1, a.attr2, a.attr3
               Buffers: shared hit=69
         ->  Limit  (cost=0.69..0.98 rows=1 width=39) (actual time=0.040..0.040 rows=1 loops=18102)
               Output: updates.time, updates.update_id, updates.amount, updates.eid, updates.originating_update_id
               Buffers: shared hit=175885
               ->  Index Only Scan using updates_covering_idx3 on public.updates  (cost=0.69..808.79 rows=2820 width=39) (actual time=0.039..0.039 rows=1 loops=18102)
                     Output: updates.time, updates.update_id, updates.amount, updates.eid, updates.originating_update_id
                     Index Cond: ((updates.account_id = a.account_id) AND (updates.attr1 = a.attr1) AND (updates.attr2 = a.attr2) AND (updates.attr3 = a.attr3) AND (updates.time <= a.for_time))
                     Filter: ((updates.time < a.for_time) OR ((updates.time = a.for_time) AND (updates.originating_update_id <= a.originating_update_id)))
                     Rows Removed by Filter: 47
                     Heap Fetches: 0
                     Buffers: shared hit=175885
 Planning time: 0.367 ms
 Execution time: 797.601 ms
(40 rows)

There are 65788426 rows in the updates table and has been recently vacuumed and analyzed. The database settings are Amazon RDS default. I've renamed the tables and fields in the output above. Everything else is unabridged.

Why would that particular clause cause such a slowdown? The EXPLAIN ANALYZE output is almost identical between the two. The main difference is the shared hit count.

Strangely, if I change to clause to a clause on some other field, e.g. originating_update_id < 100000000 the query doesn't slow down.

UPDATE

I've simplified the query as @jjanes suggested:

EXPLAIN (VERBOSE, ANALYZE, COSTS, BUFFERS)
SELECT time, update_id, amount, eid
FROM updates
WHERE account_id = '1' AND
      attr1 = '' AND
      attr2 = 3 AND
      attr3 = 'ABC' AND
      update_id < 40581 AND  -- THIS CLAUSE MAKES IT SLOW
      time <= '2017-12-30 15:57:35.267405+00' AND
      (time < '2017-12-30 15:57:35.267405+00' OR
       (time = '2017-12-30 15:57:35.267405+00' AND
    originating_update_id <= 10000000))
ORDER BY time DESC,
     originating_update_id DESC,
     update_id DESC
LIMIT 1;

With the offending clause it takes 35.5ms, reproducibly. Without that clause it takes 0.073ms.

Query plan with clause:

 Limit  (cost=0.69..0.82 rows=1 width=39) (actual time=35.467..35.468 rows=1 loops=1)
   Output: time, update_id, amount, eid, originating_update_id
   Buffers: shared hit=1856
   ->  Index Only Scan using updates_covering_idx3 on public.updates  (cost=0.69..731.25 rows=5423 width=39) (actual time=35.465..35.465 rows=1 loops=1)
         Output: time, update_id, amount, eid, originating_update_id
         Index Cond: ((updates.account_id = '1'::text) AND (updates.attr1 = ''::text) AND (updates.attr2 = 3) AND (updates.attr3 = 'ABC'::text) AND (updates.time <= '2017-12-30 15:57:35.267405'::timestamp without time zone) AND (updates.update_id < 40581))
         Filter: ((updates.time < '2017-12-30 15:57:35.267405'::timestamp without time zone) OR ((updates.time = '2017-12-30 15:57:35.267405'::timestamp without time zone) AND (updates.originating_update_id <= 10000000)))
         Heap Fetches: 0
         Buffers: shared hit=1856
 Planning time: 0.202 ms
 Execution time: 35.493 ms
(11 rows)

and without

 Limit  (cost=0.69..0.78 rows=1 width=39) (actual time=0.045..0.045 rows=1 loops=1)
   Output: time, update_id, amount, eid, originating_update_id
   Buffers: shared hit=6
   ->  Index Only Scan using updates_covering_idx3 on public.updates  (cost=0.69..762.10 rows=8383 width=39) (actual time=0.044..0.044 rows=1 loops=1)
         Output: time, update_id, amount, eid, originating_update_id
         Index Cond: ((updates.account_id = '1'::text) AND (updates.attr1 = ''::text) AND (updates.attr2 = 3) AND (updates.attr3 = 'ABC'::text) AND (updates.time <= '2017-12-30 15:57:35.267405'::timestamp without time zone))
         Filter: ((updates.time < '2017-12-30 15:57:35.267405'::timestamp without time zone) OR ((updates.time = '2017-12-30 15:57:35.267405'::timestamp without time zone) AND (updates.originating_update_id <= 10000000)))
         Heap Fetches: 0
         Buffers: shared hit=6
 Planning time: 0.201 ms
 Execution time: 0.073 ms
(11 rows)

Best Answer

This kind of makes sense if your data are distributed in certain way. Without the update_id < 40581 condition the very first entry in the index that matches account_id = '1' AND attr1 = '' AND attr2 = 3 AND attr3 = 'ABC' satisfies the LIMIT clause, because the entries are ordered appropriately. With the condition in place it has to go and find one row with the matching update_id, which may be way down the list.