PostgreSQL bad plan: the frequency of a tag depends on the project!

I try to simplify a problem that I've been investigating for weeks.

Suppose that you have a large table like this:

SUBSCRIBERS
----------------------------
id | project_id | tags | ...
----------------------------
1  | 1          | ['en', 'something']
2  | 1          | ['en']
3  | 1          | ['de']
4  | 2          | []
5  | 2          [ []
...

Now suppose that project 1 has millions of rows and most of them have the tag en.

Also project 2 has millions of rows, but there isn't any row with the tag en.

The following query generates terrible execution plans:

SELECT * FROM subscriptions WHERE project_id = 2 AND tags @> ARRAY['en']::varchar[];

Indeed the planner sees the following:

project = 2 is very common
tag = 'en' is very common

Thus the planner thinks that there are many rows to return! However no rows match the actual condition.

That creates crazy choices in the indexes or even seq scans! I mean, I verified that it takes 20 minutes instead of a few ms, due to the wrong choices!

I have already tried the following (without any benefit):

SET STATISTICS 10000 for project_id and tags
CREATE STATISTICS on project_id, tags
SET random_page_cost = 1
make sure that the perfect indexes (compound, btree, gin, etc.) exist on project_id and tags
frequent ANALYZE and much more…

I think that the problem is that PostgreSQL should collect separate stats on tags for each project id. However it seems that there is no way to achieve that… Any idea?

Also I think that CREATE STATISTICS doesn't work properly on array columns (maybe works on the whole column instead of the single values). Any specific information about this?

Best Answer

You are correct that there is no way to do this, and that CREATE STATISTICS only looks at arrays in their entirety, not their individual components.

Perhaps CREATE STATISTICS can be expanded in the future to do that, but someone would have to volunteer to implement it, and I am not aware of anyone working on such a thing.

Note that even if "tag" were a scalar, CREATE STATISTICS wouldn't help you in versions 10 or 11, as the mcv type of statistics were not added until v12, and that is what you need. So presumably what you would need is a new type of statistics, call it arrray_mcv.

Best Answer

Related Solutions

Postgresql – How to handle this complicated join condition(s)

Postgresql – How to influence the planner on Postgresql

Related Question