PostgreSQL – How to Query Results from a Huge Database

countdistinctperformancepostgresqlpostgresql-performance

I am trying to query a few tables in a 125GB Postgres database. I find a hard time getting the results back quickly!

Even to find unique rows of a column it takes more than 10 minutes. Either I use a command line or tool like PHPMyAdmin to fire the query the wait time remains the same.

For example:

SELECT COUNT (DISTINCT column_name) FROM table_name;

I am worried that if use joins operations on these tables this wait time will increase significantly and will be more painful.

Could anyone suggest what could be a possible solution to get the statistics back quickly?

Best Answer

SELECT COUNT (DISTINCT column_name) FROM table_name;

The part of the PostgreSQL code that implements "COUNT(DISTINCT...)" is quite old and hasn't had much performance work done on it recently. It can't take advantage of either parallel processing, or hash tables, for example. You could rewrite with a subquery to possibly take advantage of some newer techniques:

select count(*) from 
    (select distinct column_name from table_name where column_name is not null) as foo

But still, this is not as trivial a task as you seem to think it is. It is a hard task and may take a long time when fed huge amounts of data.

I am worried that if use joins operations on these tables this wait time will increase significantly and will be more painful.

Don't invent things to worry about. The performance of a COUNT(DISTINCT...) tells you very little about the performance of a JOIN. Write the query you want, and then worry about optimizing that query, not some other query that has little to do with the one you want.

It can make sense to simplify a query to see how long component parts of it might take. But COUNT(DISTINCT...) is unlikely to be such a simplifying component part.