PostgreSQL: ANY (VALUES(…)) in WHERE clause causes drastic slowdown

performancepostgresqlpostgresql-9.5query-performance

Okay, I previously asked a question regarding a large dataset but it was never answered, so I decided to cut it down and ask about a smaller subset of the previous setup and simplify what I am trying to accomplish in a new question – hoping this will be a bit clearer.

I have a single large table (report_drugs) that is 1775 MB on disk containing little over 33 million rows. The table layout:

    Column     |            Type             | Modifiers 
---------------+-----------------------------+-----------
 rid           | integer                     | not null
 drug          | integer                     | not null
 created       | timestamp without time zone | 
 reason        | text                        | 
 duration      | integer                     | 
 drugseq       | integer                     | 
 effectiveness | integer                     | 
Indexes:
  "report_drugs_drug_idx" btree (drug) CLUSTER
  "report_drugs_drug_rid_idx" btree (drug, rid)
  "report_drugs_reason_idx" btree (reason)
  "report_drugs_reason_rid_idx" btree (reason, rid)
  "report_drugs_rid_idx" btree (rid)

As you can see, I have several indexes (not all relevant to this question) and have CLUSTERed the table on the drug column index as this is mostly used to scope on. The table is also VACUUM ANALYZEd automatically and manually by me before getting any metrics.

Yet a simple query like this:

SELECT drug, reason FROM report_drugs WHERE drug = ANY(VALUES  (9557), (17848),
(17880), (18223), (18550), (19020), (19084), (19234), (21295), (21742), 
(23085), (26017), (27016), (29317), (33566), (35818), (37394), (39971), 
(41505), (42162), (44000), (45168), (47386), (48848), (51472), (51570), 
(51802), (52489), (52848), (53663), (54591), (55506), (55922), (57209), 
(57671), (59311), (62022), (62532), (63485), (64134), (66236), (67394), 
(67586), (68134), (68934), (70035), (70589), (70896), (73466), (75931), 
(78686), (78985), (79217), (83294), (83619), (84964), (85831), (88330), 
(89998), (90440), (91171), (91698), (91886), (91887), (93219), (93766), 
(94009), (96341), (101475), (104623), (104973), (105216), (105496), 
(106428), (110412), (119567), (121154));

Will take well over 7 seconds to complete and has the follow query plan:

Nested Loop  (cost=1.72..83532.00 rows=24164 width=26) (actual time=0.947..7385.490 rows=264610 loops=1)
->  HashAggregate  (cost=1.16..1.93 rows=77 width=4) (actual time=0.017..0.036 rows=77 loops=1)
     Group Key: "*VALUES*".column1
     ->  Values Scan on "*VALUES*"  (cost=0.00..0.96 rows=77 width=4) (actual time=0.001..0.007 rows=77 loops=1)
->  Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.56..1081.67 rows=314 width=26) (actual time=0.239..95.568 rows=3436 loops=77)
     Index Cond: (drug = "*VALUES*".column1)
Planning time: 7.009 ms
Execution time: 7393.408 ms

The more values I add to my ANY(VALUES(..)) clause, the slower it gets. This query can sometimes contain over 200 values, which will then take well over 30 seconds to complete. Yet including only a few values (4 ex.) gives me a query in under 200 ms. Thus it is clearly this part of the WHERE clause that is causing this slowdown.

What might I do to make this query perform better?
What obvious points am I missing here?

My hardware and database settings:

I'm running the cluster from an SSD drive. The system has a total memory of 24 GB, runs on Debian and uses an 4Ghz 8 core i7-4790 processor. It should be sufficient hardware for this kind of dataset.

Some important postgresql.conf readouts:

  • shared_buffers = 4GB
  • work_mem = 100MB
  • checkpoint_completion_target = 0.9
  • autovacuum = on

A side question to this:

Previously I had used WHERE drug = ANY(ARRAY[..]), but I found that using WHERE drug = ANY(VALUES(..)) gives me a significant speed boost. Why should that make a difference?


Edit 1 – JOIN on VALUES instead of WHERE clause

As a_horse_with_no_name pointed out in the comments, I tried to remove the WHERE clause and perform the query using a JOIN on the drug values:

Query:

SELECT drug, reason FROM report_drugs d JOIN (VALUES  (9557), (17848),
(17880), (18223), (18550), (19020), (19084), (19234), (21295), (21742), 
(23085), (26017), (27016), (29317), (33566), (35818), (37394), (39971), 
(41505), (42162), (44000), (45168), (47386), (48848), (51472), (51570), 
(51802), (52489), (52848), (53663), (54591), (55506), (55922), (57209), 
(57671), (59311), (62022), (62532), (63485), (64134), (66236), (67394), 
(67586), (68134), (68934), (70035), (70589), (70896), (73466), (75931), 
(78686), (78985), (79217), (83294), (83619), (84964), (85831), (88330), 
(89998), (90440), (91171), (91698), (91886), (91887), (93219), (93766), 
(94009), (96341), (101475), (104623), (104973), (105216), (105496), 
(106428), (110412), (119567), (121154)) as x(d) on x.d = d.drug;

Plan (with analyze and buffers as requested by jjanes):

Nested Loop  (cost=0.56..83531.04 rows=24164 width=26) (actual time=1.003..6927.080 rows=264610 loops=1)
  Buffers: shared hit=12514 read=111251
  ->  Values Scan on "*VALUES*"  (cost=0.00..0.96 rows=77 width=4) (actual time=0.000..0.059 rows=77 loops=1)
  ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.56..1081.67 rows=314 width=26) (actual time=0.217..89.551 rows=3436 loops=77)
        Index Cond: (drug = "*VALUES*".column1)
        Buffers: shared hit=12514 read=111251
Planning time: 7.616 ms
Execution time: 6936.466 ms

However, this seems to have no effect. While the query plan changed a bit, the execution time is roughly the same and the query is still slow.


Edit 2 – JOIN on temporary table instead of JOIN on VALUES

Following Lennart's advise I tried to create a temporary table within a single transaction, filling it with the drug values and joining against that. Although I gain ~2 seconds, the query is still very slow at little over 5 seconds.

The query plan has changed from a nested loop to a hash join and it is now doing a sequential scan on the report_drugs table. Could this be a missing index somehow (the drug column in the report_drugs table does have an index…)?

Hash Join  (cost=67.38..693627.71 rows=800224 width=26) (actual time=0.711..4999.222 rows=264610 loops=1)
  Hash Cond: (d.drug = t.drug)
  ->  Seq Scan on report_drugs d  (cost=0.00..560537.16 rows=33338916 width=26) (actual time=0.410..3144.117 rows=33338915 loops=1)
  ->  Hash  (cost=35.50..35.50 rows=2550 width=4) (actual time=0.012..0.012 rows=77 loops=1)
      Buckets: 4096  Batches: 1  Memory Usage: 35kB
      ->  Seq Scan on t  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.002..0.005 rows=77 loops=1)
Planning time: 7.030 ms
Execution time: 5005.621 ms

Best Answer

You're just comparing against a list of values, using IN would be simpler in this case:

SELECT drug, reason FROM drugs WHERE drug IN (9557,17848,17880,18223,18550);

Alternatively, if you still use ANY, using an array literal results in the same query plan as IN for me:

SELECT drug, reason FROM drugs WHERE drug = ANY ('{9557,17848,17880,18223,18550}');

I tried this is a smaller test table, and Postgres was able to do an index scan for the version of the query using IN and ANY with an array literal, but not for the query using ANY with VALUES.

The resulting plan is something like the following (but my test table and data is somewhat different):

Index Scan using test_data_id_idx on test_data  (cost=0.43..57.43 rows=12 width=8) (actual time=0.014..0.028 rows=12 loops=1)
  Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12}'::integer[]))

This should be much faster than the query plan you showed as it scans the index once, while your plan loops as many times as you have drugs in that WHERE clause.