I have the table customers
having customerid SERIAL
as a primary key. Initially I had the following statistic for that column:
attname n_distinct correlation
customerid -1 -0.393365
The following query
EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid < 10
produced the following analyzed plan:
Index Scan using customers_pkey on customers (cost=0.00..35.46 rows=9 width=268) (actual time=0.003..0.010 rows=9 loops=1)
Index Cond: (customerid < 10)
Total runtime: 0.029 ms
Note, that the cost to get all rows is 35.46
. Now, I ran
CLUSTER customers USING customers_pkey;
ANALYZE;
and the correlation
became equals to 1
. After running the query once again I got the following analyzed plan:
Index Scan using customers_pkey on customers (cost=0.00..8.41 rows=9 width=268) (actual time=0.003..0.005 rows=9 loops=1)
Index Cond: (customerid < 10)
Total runtime: 0.024 ms
Note that the cost is decreased more than 4 times (8.41
now). My question is how exactly the cost of ranged index scan
depends on the correlation? How is it computed? It would be good if you pointed out to some references to the postgresql documentation.
Best Answer
The details you want will not be found in the documentation, but rather in the source code. In particular,
src/backend/optimizer/path/costsize.c
in the functioncost_index
.It computes
min_IO_cost
based on the assumption the table is correlated,max_IO_cost
based on the assumption it is uncorrelated, and then interpolates between them based on the correlation.