PostgreSQL Performance – Manually Setting most_common_vals

performancepostgresqlpostgresql-performance

I am currently building a system for exploration purposes on top of Postgres. Essentially, this results in queries on a table of form (key, value) pairs, representing a graph model:

edge_id | term_id
--------+--------
   int  |   int

(Note that I am not working with "traditional" graphs, but rather a hypergraph model, hence the implicit storage of edge participation. This question is not about data modeling, but rather how to speed up such a model)

Fortunately, the default parameters already yield some decent runtimes, but to maximize performance I am currently investigating methods to optimize the query planner.

Since I am dealing with a Zipfian distribution, but my entries are in the range of several hundred million entries, I don't expect the query planner to accurately "guess" (or sample) this distribution. I already confirmed this suspicion by inspecting the stored values within pg_stats (and even in pg_statistic). The result is simply a more or less equal distribution, with a few spikes (but nothing too accurate, even with the maximum setting of 10000 samples).

Since my data is only changing at very infrequent intervals (less than once per month), I would assume that explicitly setting the most_common_freqs and most_common_vals would be a good idea, but I have no clue how to do that. This would simply be the equivalent of increasing the statistic sampling size to the full table, and the respective queries do not take very long in comparison to the potential speed-up.

On that note, I would also be interested how Postgres behaves on a repeated ANALYZE; my concern is that this will be overwritten, as I do not set this with an external parameter (think of n_distinct, which I can set via ALTER TABLE <name> ALTER COLUMN <colname> SET (n_distinct = <value>);.
Instead, I would have to directly UPDATE


EDIT:
In accordance with the comments, I want to provide some more details:
* There are two indices defined over the table mentioned above; one of them is the _pkey index over (edge_id, term_id), as well as an additional btree index over term_id only.
* The main query type I'm trying to optimize for (right now) is something where I try to find the most frequently co-occurring terms to some specific other term.

To give an example:

SELECT eh1.term_id, COUNT(*) as freq 
FROM entity_hyperedges eh1, entity_hyperedges eh2 
WHERE eh1.edge_id = eh2.edge_id 
  AND eh1.term_id != eh2.term_id 
  AND eh2.term_id = 289759 
GROUP BY eh1.term_id 
ORDER BY freq DESC;

where entity_hyperedges has the structure as mentioned above. The corresponding query plan has already significantly improved, but there are still orders of magnitudes off sometimes; note that this is just a random example.

    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=224317.46..224622.84 rows=122153 width=12) (actual time=928.561..929.433 rows=19921 loops=1)
   Sort Key: (count(*)) DESC
   Sort Method: quicksort  Memory: 1702kB
   ->  Finalize HashAggregate  (cost=212775.02..213996.55 rows=122153 width=12) (actual time=922.386..925.706 rows=19921 loops=1)
         Group Key: eh1.term_id
         ->  Gather  (cost=185901.36..211553.49 rows=244306 width=12) (actual time=905.030..914.302 rows=34652 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Partial HashAggregate  (cost=184901.36..186122.89 rows=122153 width=12) (actual time=901.920..904.390 rows=11551 loops=3)
                     Group Key: eh1.term_id
                     ->  Hash Join  (cost=66229.84..180935.95 rows=793082 width=4) (actual time=151.684..829.044 rows=528155 loops=3)
                           Hash Cond: (eh1.edge_id = eh2.edge_id)
                           Join Filter: (eh1.term_id <> eh2.term_id)
                           Rows Removed by Join Filter: 119566
                           ->  Parallel Seq Scan on entity_hyperedges eh1  (cost=0.00..101752.40 rows=4934740 width=8) (actual time=0.014..221.310 rows=3947792 loops=3)
                           ->  Hash  (cost=61710.60..61710.60 rows=361539 width=8) (actual time=150.257..150.257 rows=358699 loops=3)
                                 Buckets: 524288  Batches: 1  Memory Usage: 18108kB
                                 ->  Bitmap Heap Scan on entity_hyperedges eh2  (cost=4786.36..61710.60 rows=361539 width=8) (actual time=18.053..92.027 rows=358699 loops=3)
                                       Recheck Cond: (term_id = 289759)
                                       Heap Blocks: exact=24760
                                       ->  Bitmap Index Scan on entity_hyperedges_term_id  (cost=0.00..4695.98 rows=361539 width=0) (actual time=15.017..15.017 rows=358699 loops=3)
                                             Index Cond: (term_id = 289759)
 Planning time: 14.916 ms
 Execution time: 934.802 ms
(24 rows)

Best Answer

The column pg_stats.most_common_vals is of type 'anyarray', so that it can accomodate stats on any underlying column type. However, 'anyarray' is implemented with some internal magic, and I don't think there is any way to store into it from user-space. So what you want cannot be done.

However, I don't think it would help you anyway, at least if your example is a well-chosen one. Where your estimate goes far wrong is when it tries to estimate how many distinct values of eh1.term_id join into a specific value of eh2.term_id. However, I think that nothing in PostgreSQL's current stats system can help you on that. I believe it is estimating how many distinct values of eh1.term_id is likely to be found in 793,082 randomly chosen rows of eh1. It cannot accommodate the notion that those 793,082 are not actually randomly chosen. If it somehow knew there were actually 528,155 rows not 793,082, it still wouldn't have a good way of knowing how many distinct eh1.term_id they would contain.

Also, my intuition says that a better estimate here is unlikely to come up with a better plan. What would it do differently if it had a better estimate?