Postgresql – Understanding rows count computation

postgresqlstatistics

PostgreSQL 9.4

I have a table called customers which has a column income integer. After runnig ANALYZE against it I got the following statistic for the column:

       most_common_vals
{20000,80000,40000,60000,100000}

Now, I run the following simple query EXPLAIN ANALYZE SELECT * FROM customers WHERE income=123123 to understand the rows count estimating. Output:

Seq Scan on customers  (cost=0.00..738.00 rows=1 width=268) (actual time=4.669..4.669 rows=0 loops=1)
  Filter: (income = 123123)
  Rows Removed by Filter: 20000

Since the optimizer doesn't have statistic for the income value 123123, it made a wild guess for 0.5% of the table size. So, the estimated row count should have been 500. But the optimizer returned the 1. Why? Maybe I didn't understand the estimating process of row counting of unknown values?

Couldn't you explain it a bit?

Best Answer

Since it was not in most_common_vals, PostgreSQL then looks in the histogram. If there is no histogram, it would conclude the most common values are the only values present in the table, and therefore the estimate for 123123 is zero. But it doesn't allow zero estimates in most places, to prevent div by zero error, and instead clamps it at 1.

0.5% is for cases where there is no information, like for a generic query plan or a join where the value won't be known at planning time.

But having a list of MCV and the one you want not being present in that list and there being no histogram to fallback on does constitute information.