PostgreSQL – Configuring Page Scan Limits for ANALYZE

configurationpostgresql

Is there a setting that controls how many records/pages are scanned by the PostgreSQL ANALYZE command? It looks like by default it scans 30,000 pages and 30,000 records. The PostgreSQL Server Configuration Documentation lists tons of options, but I don't see anything specific to ANALYZE.

Best Answer

Each column of a table has an attstattarget property (stored in pg_attribute) that tells how much data should be stored for it from the statistical sample gathered by ANALYZE.

It defaults to default_statistics_target, which itself defaults to 100.

In Statistics Used by the Planner, the doc says:

The amount of information stored in pg_statistic by ANALYZE, in particular the maximum number of entries in the most_common_vals and histogram_bounds arrays for each column, can be set on a column-by-column basis using the ALTER TABLE SET STATISTICS command, or globally by setting the default_statistics_target configuration variable

The reason behind the 30,000 pages and rows is that the sample size in rows considered by ANALYZE is 300 times the maximum value of attstattarget for the sampled table, which would be the default 100.

The 300 comes from a statistical formula mentioned in the source code
src/backend/commands/analyze.c:

    /*--------------------
     * The following choice of minrows is based on the paper
     * "Random sampling for histogram construction: how much is enough?"
     * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
     * Proceedings of ACM SIGMOD International Conference on Management
     * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
     * says that for table size n, histogram size k, maximum relative
     * error in bin size f, and error probability gamma, the minimum
     * random sample size is
     *      r = 4 * k * ln(2*n/gamma) / f^2
     * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
     *      r = 305.82 * k
     * Note that because of the log function, the dependence on n is
     * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
     * bin size error with probability 0.99.  So there's no real need to
     * scale for n, which is a good thing because we don't necessarily
     * know it at this point.
     *--------------------
     */
    stats->minrows = 300 * attr->attstattarget;

As for the number of pages, since rows don't span across pages, at most N pages are going to be read to get N rows. I believe that ANALYZE intentionally aims at fetching this maximum of pages to get the best sample. It makes sense since rows stored in a same page are more likely to be correlated.