PostgreSQL 9.4
I have a table called customers
which has a column income
integer. After runnig ANALYZE
against it I got the following statistic for the column:
most_common_vals
{20000,80000,40000,60000,100000}
Now, I run the following simple query EXPLAIN ANALYZE SELECT * FROM customers WHERE income=123123
to understand the rows count estimating. Output:
Seq Scan on customers (cost=0.00..738.00 rows=1 width=268) (actual time=4.669..4.669 rows=0 loops=1)
Filter: (income = 123123)
Rows Removed by Filter: 20000
Since the optimizer doesn't have statistic for the income
value 123123
, it made a wild guess for 0.5%
of the table size. So, the estimated row count should have been 500
. But the optimizer returned the 1
. Why? Maybe I didn't understand the estimating process of row counting of unknown values?
Couldn't you explain it a bit?
Best Answer
Since it was not in most_common_vals,
PostgreSQL
then looks in the histogram. If there is no histogram, it would conclude the most common values are the only values present in the table, and therefore the estimate for 123123 is zero. But it doesn't allow zero estimates in most places, to prevent div by zero error, and instead clamps it at 1.0.5% is for cases where there is no information, like for a generic query plan or a join where the value won't be known at planning time.
But having a list of MCV and the one you want not being present in that list and there being no histogram to fallback on does constitute information.