Postgresql – Improving distinct values estimates in Postgres

cardinality-estimatesdistinctpostgresql

Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by EXPLAIN works okay.

https://www.cybertec-postgresql.com/en/count-made-fast/

But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?).

I just ran this test and have provided some results:

analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.

select 'count(*) distinct' as method,
        count(*) as count
from (select distinct assembly_id 
      from assembly_prods) d 
union all
select 'n_distinct from pg_stats' as method,
        n_distinct as count
from pg_stats 
where tablename  = 'assembly_prods' and
      attname    = 'assembly_id';

The results:

method                      count
count(*) distinct           28088
n_distinct from pg_stats    13805

That's only off by a factor of 2, but I've seem much worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?

Follow-up

I hadn't ever experimented SET STATISTICS before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation:

https://www.postgresql.org/docs/current/planner-stats.html

The amount of information stored in pg_statistic by ANALYZE, in
particular the maximum number of entries in the most_common_vals and
histogram_bounds arrays for each column, can be set on a
column-by-column basis using the ALTER TABLE SET STATISTICS command,
or globally by setting the default_statistics_target configuration
variable. The default limit is presently 100 entries. Raising the
limit might allow more accurate planner estimates to be made,
particularly for columns with irregular data distributions, at the
price of consuming more space in pg_statistic and slightly more time
to compute the estimates. Conversely, a lower limit might be
sufficient for columns with simple data distributions.

I have often got tables with a few common values and a lot of rare values. Or the other way around, so the right threshold will depend. For those who haven't used SET STATISTICS, it lets you set the sampling rate as a target number of entries. The default is 100, so 1000 should be higher fidelity. Here's what that looks like:

ALTER TABLE assembly_prods 
    ALTER COLUMN assembly_id
    SET STATISTICS 1000;

You can use SET STATISTICS on a table or index. Here's an interesting piece on indexes:

https://akorotkov.github.io/blog/2017/05/31/alter-index-weird/

Note that the current documentation does list SET STATISTICS on indexes.

So I tried out thresholds of 1, 10, 100, 1000, and 10,000 and got these results out of a table with 467,767 rows and 28,088 distinct values:

Target   Estimate  Difference  Missing
     1   13,657    14,431      51%
    10   13,867    14,221      51%
   100   13,759    14,329      51%
 1,000   24,746     3,342      12%
10,000   28,088         0       0%

Obviously you can't draw any general conclusions from one case, but SET STATISTICS looks pretty darn useful and I'll be glad to have it in the back of my mind. I'm tempted to raise the target a bit in general as I suspect it would help in many of the cases in our system.

Best Answer

First, a remark: your query could be written simpler as

SELECT count(DISTINCT assembly_id) FROM assembly_prods;

Also, your statistics query is wrong, because n_distict can also be negative. You should query:

SELECT CASE WHEN s.n_distinct < 0
            THEN - s.n_distinct * t.reltuples
            ELSE s.n_distinct
       END AS n_distinct
FROM pg_class t
   JOIN pg_namespace n ON n.oid = t.relnamespace
   JOIN pg_stats s ON t.relname = s.tablename
                      AND n.nspname = s.schemaname
WHERE s.schemaname = 'public'
  AND s.tablename = 'assembly_prods'
  AND s.attname = 'assembly_id';

For a simple query like that, the statistics should contain a good estimate.

If the estimates are off, try to ANALYZE the table. That will also fix the results for a newly TRUNCATEd table. TRUNCATE does not cause PostgreSQL to autoanalyze the table (there may be room for improvement here).

If that improves the results, see that the table is analyzed more often by configuring

ALTER TABLE assembly_prods SET (autovacuum_analyze_scale_factor = 0.05);

It is also possible to set autovacuum_analyze_scale_factor to 0 and raise autovacuum_analyze_threshold to the daily change rate for the table.

If ANALYZE alone does not improve the estimate, increase the size of the sample:

ALTER TABLE assembly_prods ALTER assembly_id SET STATISTICS 1000;

A new ANALYZE should now produce better estimates.

Getting good n_distinct estimates for more complicated queries becomes increasingly more difficult. Sometimes extended statistics will improve the estimate considerably.

As far as I know, PostgreSQL v12 does not bring any improvements in this area.

Related Solutions

Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

First off, LIMIT / OFFSET without ORDER BY are of limited usefulness, since the order is arbitrary and can change any time (when VACUUM runs or when the table is manipulated in at any way). It is only somewhat reliable with read-only tables. That's fine if you don't care which rows you get back, but it may break paging.

You may be able to solve your conundrum with the good old EXISTS. PostgreSQL can stop searching for more hits as soon as the first is found - as opposed to your query with DISTINCT, where it tries to collect all matches.

It's hard to be more specific without knowing the table structure, cardinalities, index definitions and what's behind your peculiar WHERE expression. But this might just do it:

SELECT s.chem_compound_id AS chem1_0_ 
       -- 10 more columns all short varchar or date fields
FROM   simple_compound s
WHERE  EXISTS (
   SELECT 1
   FROM   compound_composition sc
   JOIN   chemical_structure   c  ON sc.chemical_structure_id = c.structure_id 
   WHERE  c.structure_id  @ ('CCNc1ccccc1', '')::bingo.sub
   AND    sc.chem_compound_id = s.chem_compound_id
   )
LIMIT  5 
OFFSET 5

This also assumes you are only interested in columns from simple_compound in the output.

Sql-server – Why is an additional filtered statistic being ignored (EAV schema)

Filtered indexes and statistics won't come into play when you're using local variables, unless you use the OPTION (RECOMPILE) query hint, and are running SQL Server 2008 R2 or later.

Tim Chapman's MSDN blog post explains with examples.

Follow-up

Best Answer

Related Solutions

Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

Sql-server – Why is an additional filtered statistic being ignored (EAV schema)

Related Question