Postgresql – How does the correlation statistic parameter affect the cost of index scan

I have the table customers having customerid SERIAL as a primary key. Initially I had the following statistic for that column:

  attname   n_distinct    correlation
customerid       -1         -0.393365

The following query

EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid < 10

produced the following analyzed plan:

Index Scan using customers_pkey on customers  (cost=0.00..35.46 rows=9 width=268) (actual time=0.003..0.010 rows=9 loops=1)
  Index Cond: (customerid < 10)
Total runtime: 0.029 ms

Note, that the cost to get all rows is 35.46. Now, I ran

CLUSTER customers USING customers_pkey;
ANALYZE;

and the correlation became equals to 1. After running the query once again I got the following analyzed plan:

Index Scan using customers_pkey on customers  (cost=0.00..8.41 rows=9 width=268) (actual time=0.003..0.005 rows=9 loops=1)
  Index Cond: (customerid < 10)
Total runtime: 0.024 ms

Note that the cost is decreased more than 4 times (8.41 now). My question is how exactly the cost of ranged index scan depends on the correlation? How is it computed? It would be good if you pointed out to some references to the postgresql documentation.

Best Answer

The details you want will not be found in the documentation, but rather in the source code. In particular, src/backend/optimizer/path/costsize.c in the function cost_index.

It computes min_IO_cost based on the assumption the table is correlated, max_IO_cost based on the assumption it is uncorrelated, and then interpolates between them based on the correlation.

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

PostgreSQL Indexing – How to Index WHERE (start_date >= ‘2013-12-15’)

Related Question