PostgreSQL – Understanding Default_Statistics_Target Value

postgresql

Increasing the default_statistics_target value can make your database faster, specially after analyze….

Reading this article I see that https://discuss.pivotal.io/hc/en-us/articles/201581033-default-statistics-target-Explained

(…)in short and in basic term, this
parameter control the way the stats are collected , with value 1 being
the least estimated/accurate statistics and the value 1000 being the
most accurate statistics , obviously with the expense of time /
resources ( CPU , memory etc ) / space . Normally the default value is
sufficient to get a accurate plan , but if you have a complex data
distribution / or a column is referenced in the query quite often ,
then setting a higher value might help in getting a better statistics
on the table and hence a better plan for the optimizer to execute.

It is a good explanation, but for example if I set default_statistics_target= 1000 what 1000 really means? It is 1000 kilobytes of statistics being generated? or maybe it is 1000 rows of the tables analyzed? Maybe it is 1000 columns? or perhaps 1000 seconds for each analyze…

So my question is how this number is really affecting the analyze or the query planner? Obvious I understand that default_statistics_target = 1000 will get more time than 100, for running analyze, and that 1000 will generate better statistics…

Best Answer

It will sample 300 * default_statistics_target rows from each table. It will use that sample to determine upto default_statistics_target most common values to store in that array, and upto default_statistics_target histogram bounds to store in that array. Plus a few other scalar statistics, like the number of distinct values.

The multiplier 300 was chosen because some statistical theory says that is how many you need to sample per each histogram bound you wish to compute, in order for your sampled histogram bounds to have an acceptable level of uncertainty.

The most common value list is used to help the planner predict the selectivity of equality expressions, like where state='CA'. The histogram bounds are used to help the planner predict the selectivity of inequality or range expressions, like where income between 55000 and 64000

Related Solutions

Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

Since I like replacing aggregate functions by old-fashioned self-joins and NOT EXISTS clauses, here is my attempt:

SET search_path='tmp';

DROP TABLE tmp.changes CASCADE;
CREATE TABLE tmp.changes
        ( id integer NOT NULL PRIMARY KEY
        , fullname varchar
        , issuer varchar
        , rsymbol varchar
        , industry varchar
        , activity INTEGER NOT NULL
        , shareschange FLOAT
        , sharespchange FLOAT
        , mfiled FLOAT
        );

        -- lacking information from the OP
        -- I can only presume a flat distribution.
INSERT INTO tmp.changes(id, activity, shareschange,sharespchange,mfiled )
SELECT nm.*
        , (random() *20)::integer -- mfiled
        , random() *10000
        , random() *100
        , random() *100000
FROM generate_series(1,1000000) nm
        ;

ALTER TABLE tmp.changes
        ALTER shareschange
        SET STATISTICS 1000
        ;
ALTER TABLE tmp.changes
        ALTER mfiled
        SET STATISTICS 1000
        ;

VACUUM ANALYZE tmp.changes
        ;


CREATE INDEX changes_mfiled_shareschange
    ON tmp.changes(mfiled,shareschange)
        ;

EXPLAIN ANALYZE
SELECT initcap(ch.fullname) AS some_name1
     , initcap(ch.issuer) AS some_name2
     , upper(ch.rsymbol) AS some_name3
     , initcap(ch.industry) AS some_name4
     , ch.activity
     , to_char(ch.shareschange,'FM9,999,999,999,999,999') AS some_name5
     , ch.sharespchange || '%' AS some_name6
FROM   changes ch
WHERE  ch.activity IN (4,5)
        -- NOTE: the subquery is *not* correlated.
        -- [I had expected a subselect of nx.activity IN (4,5)
        -- like in the main query. ]
AND    NOT EXISTS (SELECT * FROM changes nx
        WHERE nx.mfiled > ch.mfiled
        )
ORDER  BY ch.shareschange ASC
LIMIT  15
        ;

PostgreSQL Optimization – ANALYZE Strategy for Big Tables

As already stated in the above comments there are some details hidden. I understand from your question that the query plan changes after an ANALYZE. This may indicate that the statistical data used by the query planner are not reflecting the real distribution of the data.

ANALYZE in any case takes only a sample - it does not investigate the whole table. This means tweaking the autovacuum_analyze_threshold makes only sense to me if the new rows would change the distribution in the whole table dramatically. This depends on your use case.

Much more important seems to me to adjust the size of the sample taken by ANALYE. You can influence the sample size for your table by setting the statistics target (unfortunately it is not mentioned in the question). In this blog post it is shown, how the statistics target influences the validity of the sample taken by ANALYZE.

Best Answer

Related Solutions

Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

PostgreSQL Optimization – ANALYZE Strategy for Big Tables

Related Question