Postgresql – Estimate average and median efficiently in Postgres

postgresqlscalabilitystatistics

I have a Postgres database with tables in the billion scale. So any aggregate functions such as count() and avg(), as well as "order by random()" are very time consuming. Postgres has pg_catalog which contains lots of useful statistics (such as the histogram bins in view pg_stats) that describe a database. Is there a way to take advantage of the statistics in pg_catalog to estimate the average and median numbers over a numeric column in a Postgres table?

Best Answer

If an estimate is good enough, then statistical sampling is your friend. I'd probably use a sample size calculator to determine how many rows I need, then write some code to randomly insert that many keys into a table. A join, a function, and you're done.

If you've never done anything like this before, you'll probably want to do some background reading. When I had to do that stuff, I used a handbook from nist.gov. (And you'll probably be surprised at how small a sample you need.)