PostgreSQL – Understanding Default_Statistics_Target Value

postgresql

Increasing the default_statistics_target value can make your database faster, specially after analyze….

Reading this article I see that https://discuss.pivotal.io/hc/en-us/articles/201581033-default-statistics-target-Explained

(…)in short and in basic term, this
parameter control the way the stats are collected , with value 1 being
the least estimated/accurate statistics and the value 1000 being the
most accurate statistics , obviously with the expense of time /
resources ( CPU , memory etc ) / space . Normally the default value is
sufficient to get a accurate plan , but if you have a complex data
distribution / or a column is referenced in the query quite often ,
then setting a higher value might help in getting a better statistics
on the table and hence a better plan for the optimizer to execute.

It is a good explanation, but for example if I set default_statistics_target= 1000 what 1000 really means? It is 1000 kilobytes of statistics being generated? or maybe it is 1000 rows of the tables analyzed? Maybe it is 1000 columns? or perhaps 1000 seconds for each analyze…

So my question is how this number is really affecting the analyze or the query planner? Obvious I understand that default_statistics_target = 1000 will get more time than 100, for running analyze, and that 1000 will generate better statistics…

Best Answer

It will sample 300 * default_statistics_target rows from each table. It will use that sample to determine upto default_statistics_target most common values to store in that array, and upto default_statistics_target histogram bounds to store in that array. Plus a few other scalar statistics, like the number of distinct values.

The multiplier 300 was chosen because some statistical theory says that is how many you need to sample per each histogram bound you wish to compute, in order for your sampled histogram bounds to have an acceptable level of uncertainty.

The most common value list is used to help the planner predict the selectivity of equality expressions, like where state='CA'. The histogram bounds are used to help the planner predict the selectivity of inequality or range expressions, like where income between 55000 and 64000