Postgresql – How to optimize min/max queries for large tables on PostgreSQL

optimizationperformancepostgresqlpostgresql-performance

How do you index a table in PostgreSQL so that min/max queries return as quickly as possible?

I have a large table with a couple hundred million rows. Each row as a source_id and a date when the record was last updated. I'd like to collect some statistics for each source_id, specifically the min and max date range for each source_id.

So I created this index on my table:

 CREATE INDEX CONCURRENTLY mydata_source_last_updated_date ON mydata (source_id, last_updated_date ASC);

However, when I try to query the min dates per source with:

SELECT source_id, MIN(last_updated_date) FROM mydata GROUP BY source_id;

the query takes about an hour to complete.

Is this normal performance for such a large table, even with an index? How can I reduce this query time?

Best Answer

With only a few dozen distinct values of source_id, you can get fast execution on the index you built, using a loose index scan, aka a "skip scan". Unfortunately PostgreSQL does not plan those automatically, so you have to force it into one, by using a recursive query.

with recursive t as ( 
   select min(source_id) as col from mydata 
   union all 
   select (select min(source_id) from mydata where source_id>t.col)
      from t where t.col is not null) 
select 
  col, 
  (select min(last_updated_date) from mydata where source_id=col),
  (select max(last_updated_date) from mydata where source_id=col)
  from t;

Even if you don't resort to this, just doing the query as you originally wrote it should not take nearly an hour. But without seeing an explain and an explain analyze, there isn't much more that can be said on that.