Postgresql – Estimate average and median efficiently in Postgres

postgresqlscalabilitystatistics

I have a Postgres database with tables in the billion scale. So any aggregate functions such as count() and avg(), as well as "order by random()" are very time consuming. Postgres has pg_catalog which contains lots of useful statistics (such as the histogram bins in view pg_stats) that describe a database. Is there a way to take advantage of the statistics in pg_catalog to estimate the average and median numbers over a numeric column in a Postgres table?

Best Answer

If an estimate is good enough, then statistical sampling is your friend. I'd probably use a sample size calculator to determine how many rows I need, then write some code to randomly insert that many keys into a table. A join, a function, and you're done.

If you've never done anything like this before, you'll probably want to do some background reading. When I had to do that stuff, I used a handbook from nist.gov. (And you'll probably be surprised at how small a sample you need.)

Related Solutions

Postgresql – Postgres scalability – What is the impact of connection pooling

I think you see a false dichotomy that does not exist.

It can be useful to have connection pooling in place even if you expect a 1:1 mapping of clients to back-ends. If your connections are long-lived, you won't benefit from reducing backend setup/teardown overhead, as it's small and amortized across a long period. A pool like PgBouncer may remain useful for other reasons:

Block until a connection is available rather than return an immediate max_connections exceeded error;
You can switch the pool target server if you fail-over to a standby without having to reconfigure applications;
You can limit application database workers to lower than max_connections, so you can still make reporting and maintenance connections as a non-superuser.

Additionally, if suitable for your application you can use transaction-level pooling to greatly increase the number of clients that can be served by your server.

I would not try to keep a strict 1:1 mapping from Apache workers to PostgreSQL workers, personally. If you've got (say) 16 cores and good I/O on your PostgreSQL box you might want something like 16-20 active PostgreSQL workers for optimal performance. You're almost certain to want more Apache workers than that, since they'll be kept busy by things like:

persistent HTTP connections from idle clients;
Unresponsive or very slow clients;
Intentional DoS connections;
Network interruptions between client and server; etc

If possible, consider a transaction-pooling design with short-lived transactions instead.

Mysql – Find the average number of n:m connections of two tables

You only need the counts from the 2 tables and a division:

SELECT 
    (SELECT COUNT(*) FROM Lists_Users)  /  (SELECT COUNT(*) FROM `User`) 
    AS average_lists_per_user ;

If you want more statistical information, you can write it using derived tables:

SELECT 
    total_users, active_users, total_assignments, 
    total_assignments / total_users                  -- this is
        AS average_lists_per_user,                   -- what you want
    total_assignments / active_users 
        AS average_lists_per_active_user
FROM
    ( SELECT COUNT(*) AS total_assignments,
             COUNT(DISTINCT user_id) AS active_users
      FROM Lists_Users
    ) AS lu
  CROSS JOIN
    ( SELECT COUNT(*) AS total_users 
      FROM `User`
    ) AS u ;

Best Answer

Related Solutions

Postgresql – Postgres scalability – What is the impact of connection pooling

Mysql – Find the average number of n:m connections of two tables

Related Question