I have a Postgres database with tables in the billion scale. So any aggregate functions such as count() and avg(), as well as "order by random()" are very time consuming. Postgres has pg_catalog which contains lots of useful statistics (such as the histogram bins in view pg_stats) that describe a database. Is there a way to take advantage of the statistics in pg_catalog to estimate the average and median numbers over a numeric column in a Postgres table?
Postgresql – Estimate average and median efficiently in Postgres
postgresqlscalabilitystatistics
Related Question
- MDX Functions – Fixing Incorrect Max, Average, and StDev Results
- Sql-server – Why can the index seek estimate the right number of rows and the sort operator can’t
- PostgreSQL Statistics – Where Does Postgres Store All Statistics?
- Cardinality Estimates – Understanding Outside the Histogram in SQL Server
- PostgreSQL – Calculate 12 Months Rolling Average, Median, Min, Max, Percentiles
- PostgreSQL – Estimating REINDEX Time
- PostgreSQL – Row Estimate Accuracy for Primary Key Column
Best Answer
If an estimate is good enough, then statistical sampling is your friend. I'd probably use a sample size calculator to determine how many rows I need, then write some code to randomly insert that many keys into a table. A join, a function, and you're done.
If you've never done anything like this before, you'll probably want to do some background reading. When I had to do that stuff, I used a handbook from nist.gov. (And you'll probably be surprised at how small a sample you need.)