PostgreSQL – Configuring Page Scan Limits for ANALYZE

configurationpostgresql

Is there a setting that controls how many records/pages are scanned by the PostgreSQL ANALYZE command? It looks like by default it scans 30,000 pages and 30,000 records. The PostgreSQL Server Configuration Documentation lists tons of options, but I don't see anything specific to ANALYZE.

Best Answer

Each column of a table has an attstattarget property (stored in pg_attribute) that tells how much data should be stored for it from the statistical sample gathered by ANALYZE.

It defaults to default_statistics_target, which itself defaults to 100.

In Statistics Used by the Planner, the doc says:

The amount of information stored in pg_statistic by ANALYZE, in particular the maximum number of entries in the most_common_vals and histogram_bounds arrays for each column, can be set on a column-by-column basis using the ALTER TABLE SET STATISTICS command, or globally by setting the default_statistics_target configuration variable

The reason behind the 30,000 pages and rows is that the sample size in rows considered by ANALYZE is 300 times the maximum value of attstattarget for the sampled table, which would be the default 100.

The 300 comes from a statistical formula mentioned in the source code
src/backend/commands/analyze.c:

    /*--------------------
     * The following choice of minrows is based on the paper
     * "Random sampling for histogram construction: how much is enough?"
     * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
     * Proceedings of ACM SIGMOD International Conference on Management
     * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
     * says that for table size n, histogram size k, maximum relative
     * error in bin size f, and error probability gamma, the minimum
     * random sample size is
     *      r = 4 * k * ln(2*n/gamma) / f^2
     * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
     *      r = 305.82 * k
     * Note that because of the log function, the dependence on n is
     * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
     * bin size error with probability 0.99.  So there's no real need to
     * scale for n, which is a good thing because we don't necessarily
     * know it at this point.
     *--------------------
     */
    stats->minrows = 300 * attr->attstattarget;

As for the number of pages, since rows don't span across pages, at most N pages are going to be read to get N rows. I believe that ANALYZE intentionally aims at fetching this maximum of pages to get the best sample. It makes sense since rows stored in a same page are more likely to be correlated.

Related Solutions

Postgresql – Trying to start a newly created cluster in PostgreSQL on Windows

Aside from the upgrade recommendation (which I second btw but will not solve your current problem) your basic issue is that PostgreSQL will not run as an administrative user. This is a security feature.

Now, if this is a desktop not on a domain model, you need to create a limited user, and run initdb as that (or assign ownership to all files under the existing second cluster to that user if it already exists which it may). Then read the runas documentation.

Then run something like:

 runas /user:<computername>\postgres postgres -D C:\my\new\cluster\data\directory

So if your computer name is mycomputer, it would look like:

runas /user:mycomputer\postgres postgres -D C:\my\new\cluster\data\directory

You can also:

pg_ctl register -N postgresqlc2 -U postgres -P [password] -D C:\my\new\cluster\data\directory ....

PostgreSQL vs SQL Server – Performance on Ubuntu Server vs Windows 7

How do I avoid sort in the explain result? Actuall I didn't ask any sort in the SQL Statement.

You've asked for rows to be aggregated. One way to do this is to sort the data set and then scan it to collapse out duplicates. This can be faster than hash aggregation, which is the other way PostgreSQL knows how to do grouping.

So while you didn't explicitly say "sort the rows", it's still sorting them because of something you asked for.

The immediate problem is that PostgreSQL is being very conservative with how much memory it's using for sorts:

Sort Method: external merge Disk: 317656kB

and is doing a 300MB on-disk sort. You can see that pretty clearly if you take a look at the plain on explain.depesz.com.

If you:

SET work_mem = '400MB';

before running the query, it should be a whole lot different.

Unfortunately it's not a simple as changing your configuration, because PostgreSQL isn't too clever about resource management. Per the documentation on work_mem, it may use up to work_mem bytes per sort or join, per-session. So if you have max_connections = 50 and you're running complicated queries, you might find yourself using many gigabytes of working memory, exceeding free memory, and hitting swap. Which you really don't want.

It also seems to be doing a seqscan on counters, but since it's finding about 1/4 the rows match the condition, that's probably the right thing to do - an index scan would probably be slower.

I find the default work_mem way too conservative and tend to set it to at least 100MB on any reasonably large system. I also prefer to run PostgreSQL with PgBouncer in front, and a low max_connections, allowing me to throw more resources at each individual connection.

Frankly, I'm curious to see how MS SQL Server executes this, because the numbers you report are astonishing for a query like this.

Best Answer

Related Solutions

Postgresql – Trying to start a newly created cluster in PostgreSQL on Windows

PostgreSQL vs SQL Server – Performance on Ubuntu Server vs Windows 7

Related Question