Postgresql – “Filter:” instead of “Index cond:” in index production usage

index-tuningpostgresqlpostgresql-performance

An index to optimize fetching multiple external identifiers and update underlying data was created. The index is a simple btree of the fields. The external identifier needs to be a string as it depends on the source the data.

create index reports_collector_existing_index 
on reports(provider_id, connection_id, id_on_site);

The query is running very slow in production (PostgreSQL 13.2 in Amazon RDS), and the explain indicates that the index isn't fully being used with the identifiers field in it being ignored and instead it is going to the table via Filter:, as described below:

production=# explain analyse SELECT * FROM reports WHERE (("provider_id" = 38) AND ("connection_id" = 1175966) 
AND ("id_on_site" IN ('104992038676612727711', '109990966674014073837', '105453161196787392789', '102255741242869923419', '104173931474885108430', '106078528214550610325', '115918069366580690259', '101145134508987068276', '115359007991887431873', '115909079192172351351', '109019484929659145996', '114223295457618192193', '100701806344027988654', '116172834178410664645', '109271042536972421195', '112763372050251071013', '114228853909813896127', '100378649534685813241', '109853822973690938225'));

 Index Scan using reports_collector_existing_index on reports  (cost=0.70..2.78 rows=1 width=695) (actual time=1588.672..78750.175 rows=50 loops=1)
   Index Cond: ((provider_id = 38) AND (connection_id = 1175966))
   Filter: (id_on_site = ANY ('{104992038676612727711,109990966674014073837,105453161196787392789,102255741242869923419,104173931474885108430,106078528214550610325,115918069366580690259,101145134508987068276,11535900799188743187>
   Rows Removed by Filter: 3675
 Planning Time: 2052.299 ms
 Execution Time: 78750.268 ms
(6 rows)

The same database dumped into a local environment (although excluding more than 1tb of tables' data for the sake of development) is used to run the same query. The explain shows then proper index usage:

development=# <same query from above>
 Sort  (cost=210.14..210.15 rows=6 width=615) (actual time=0.207..0.209 rows=50 loops=1)
   Sort Key: id_on_site
   Sort Method: quicksort  Memory: 43kB
   ->  Index Scan using reports_collector_existing_index on reports  (cost=0.41..210.06 rows=6 width=615) (actual time=0.038..0.177 rows=50 loops=1)
         Index Cond: ((provider_id = 38) AND (connection_id = 1175966) AND (id_on_site = ANY ('{104992038676612727711,109990966674014073837,105453161196787392789,102255741242869923419,104173931474885108430,106078528214550610325,115>
 Planning Time: 0.637 ms
 Execution Time: 0.243 ms

What could be causing the production environment to not fully use the index?

I have run ANALYZE on the table and also created statistics on the columns involved. I have tried recreating the index, using ASC, nulls last, and partial indexes. All result in the same slow Filter:. I have no idea why the sort node is only present on dev, though I would love the production had that sort too. Filtering on the first 2 integers is very fast.

On production it is going through all records in the provider_id/connection_id scope, that is why 3675 iterations.

Best Answer

The planner does weigh between the options and chooses the one that it thinks will be cheaper, of course based on the estimates it has at hand at the time.

When having the ANY in the "Index Cond", it needs to re-descend the index one time for each of the 19 members of your list. Each time it re-descends, it thinks it will land on a different index leaf page costing one random_page_cost (a questionable assumption perhaps, but the alternative assumption would be just as questionable). This leads to a high estimated cost. It needs to pay this cost even for the members that don't end up matching to any index entry.

When the ANY is in the Filter, it thinks it will have to scan all rows meeting the provider_id = 38 AND connection_id = 1175966 condition, and apply the ANY to each one. But the thing is, we have no way of knowing how many rows it thinks that that will be. The EXPLAIN reports how many rows it thinks will survive the ANY filter, but doesn't report to us how many it thinks will be tested but fail (it has that estimate under the hood somewhere, it just doesn't make it visible to us). EXPLAIN ANALYZE reports how many actually failed, but the planner can't plan retroactively based on that value of course.

It looks to me like your production server thinks only one row will have to be tested against the filter, which explains why it has such a low cost estimate and gets chosen. Whereas your dev server has a more accurate estimate of how many rows need to be tested. The way to test this hypothesis would be to run slightly different query and see what it gives for the row estimates on each machine.

explain SELECT * FROM reports WHERE "provider_id" = 38 AND "connection_id" = 1175966;

There is another possibility that the executor doesn't implement, which means the planner does not consider it. It could scan the whole section of the index where "provider_id" = 38 AND "connection_id" = 1175966 and test the ANY against the id_on_site value stored in the index, rather than jumping to the table to test the id_on_site stored there. I think the reason it does not do this is that it does not trust the ANY/IN test not to throw errors, so it is not going to risk testing it in the index and possibly throwing an error when that tuple in the table might be obsolete anyway. No one would thank the PostgreSQL developers for making it throw errors based on the invisible detritus! Now in this case it should be safe as it just a combination of = and OR, both of which are safe. But I guess the system is not clever enough to know that.

So what can you do about it? One thing would be fix the estimate of how many rows will match the provider_id = 38 AND connection_id = 1175966 condition on production (in which the first step would be finding out what that estimate is). Another would be to get an index-only scan by adding more columns to the index or removing columns from the select list. If you can get an index-only scan, it would test the ANY condition against the index value rather than the table value as long as the table page is marked all visible. Another thing you could do is switch your RDS to a storage class which doesn't suck. It appears to be taking over 20ms per random page read, which as Erwin points out is horrible.