Postgresql – Improving distinct values estimates in Postgres

cardinality-estimatesdistinctpostgresql

Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by EXPLAIN works okay.

https://www.cybertec-postgresql.com/en/count-made-fast/

But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?).

I just ran this test and have provided some results:

analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.

select 'count(*) distinct' as method,
        count(*) as count
from (select distinct assembly_id 
      from assembly_prods) d 
union all
select 'n_distinct from pg_stats' as method,
        n_distinct as count
from pg_stats 
where tablename  = 'assembly_prods' and
      attname    = 'assembly_id';

The results:

method                      count
count(*) distinct           28088
n_distinct from pg_stats    13805

That's only off by a factor of 2, but I've seem much worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?

Follow-up

I hadn't ever experimented SET STATISTICS before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation:

https://www.postgresql.org/docs/current/planner-stats.html

The amount of information stored in pg_statistic by ANALYZE, in
particular the maximum number of entries in the most_common_vals and
histogram_bounds arrays for each column, can be set on a
column-by-column basis using the ALTER TABLE SET STATISTICS command,
or globally by setting the default_statistics_target configuration
variable. The default limit is presently 100 entries. Raising the
limit might allow more accurate planner estimates to be made,
particularly for columns with irregular data distributions, at the
price of consuming more space in pg_statistic and slightly more time
to compute the estimates. Conversely, a lower limit might be
sufficient for columns with simple data distributions.

I have often got tables with a few common values and a lot of rare values. Or the other way around, so the right threshold will depend. For those who haven't used SET STATISTICS, it lets you set the sampling rate as a target number of entries. The default is 100, so 1000 should be higher fidelity. Here's what that looks like:

ALTER TABLE assembly_prods 
    ALTER COLUMN assembly_id
    SET STATISTICS 1000;

You can use SET STATISTICS on a table or index. Here's an interesting piece on indexes:

https://akorotkov.github.io/blog/2017/05/31/alter-index-weird/

Note that the current documentation does list SET STATISTICS on indexes.

So I tried out thresholds of 1, 10, 100, 1000, and 10,000 and got these results out of a table with 467,767 rows and 28,088 distinct values:

Target   Estimate  Difference  Missing
     1   13,657    14,431      51%
    10   13,867    14,221      51%
   100   13,759    14,329      51%
 1,000   24,746     3,342      12%
10,000   28,088         0       0%

Obviously you can't draw any general conclusions from one case, but SET STATISTICS looks pretty darn useful and I'll be glad to have it in the back of my mind. I'm tempted to raise the target a bit in general as I suspect it would help in many of the cases in our system.

Best Answer

First, a remark: your query could be written simpler as

SELECT count(DISTINCT assembly_id) FROM assembly_prods;

Also, your statistics query is wrong, because n_distict can also be negative. You should query:

SELECT CASE WHEN s.n_distinct < 0
            THEN - s.n_distinct * t.reltuples
            ELSE s.n_distinct
       END AS n_distinct
FROM pg_class t
   JOIN pg_namespace n ON n.oid = t.relnamespace
   JOIN pg_stats s ON t.relname = s.tablename
                      AND n.nspname = s.schemaname
WHERE s.schemaname = 'public'
  AND s.tablename = 'assembly_prods'
  AND s.attname = 'assembly_id';

For a simple query like that, the statistics should contain a good estimate.

If the estimates are off, try to ANALYZE the table. That will also fix the results for a newly TRUNCATEd table. TRUNCATE does not cause PostgreSQL to autoanalyze the table (there may be room for improvement here).

If that improves the results, see that the table is analyzed more often by configuring

ALTER TABLE assembly_prods SET (autovacuum_analyze_scale_factor = 0.05);

It is also possible to set autovacuum_analyze_scale_factor to 0 and raise autovacuum_analyze_threshold to the daily change rate for the table.

If ANALYZE alone does not improve the estimate, increase the size of the sample:

ALTER TABLE assembly_prods ALTER assembly_id SET STATISTICS 1000;

A new ANALYZE should now produce better estimates.

Getting good n_distinct estimates for more complicated queries becomes increasingly more difficult. Sometimes extended statistics will improve the estimate considerably.

As far as I know, PostgreSQL v12 does not bring any improvements in this area.