Postgresql – How to optimize min/max queries for large tables on PostgreSQL

optimizationperformancepostgresqlpostgresql-performance

How do you index a table in PostgreSQL so that min/max queries return as quickly as possible?

I have a large table with a couple hundred million rows. Each row as a source_id and a date when the record was last updated. I'd like to collect some statistics for each source_id, specifically the min and max date range for each source_id.

So I created this index on my table:

 CREATE INDEX CONCURRENTLY mydata_source_last_updated_date ON mydata (source_id, last_updated_date ASC);

However, when I try to query the min dates per source with:

SELECT source_id, MIN(last_updated_date) FROM mydata GROUP BY source_id;

the query takes about an hour to complete.

Is this normal performance for such a large table, even with an index? How can I reduce this query time?

Best Answer

With only a few dozen distinct values of source_id, you can get fast execution on the index you built, using a loose index scan, aka a "skip scan". Unfortunately PostgreSQL does not plan those automatically, so you have to force it into one, by using a recursive query.

with recursive t as ( 
   select min(source_id) as col from mydata 
   union all 
   select (select min(source_id) from mydata where source_id>t.col)
      from t where t.col is not null) 
select 
  col, 
  (select min(last_updated_date) from mydata where source_id=col),
  (select max(last_updated_date) from mydata where source_id=col)
  from t;

Even if you don't resort to this, just doing the query as you originally wrote it should not take nearly an hour. But without seeing an explain and an explain analyze, there isn't much more that can be said on that.

Related Solutions

Postgresql – Postgres 8.4.5: Bad query plan vs Good query plan when changing date range on large data table

This is a "community wiki" answer provided by the OP, which I've removed from the question.

SOLVED.

I was able to resolve this issue by altering the time.date column to allow 1000 stats: ALTER TABLE time ALTER COLUMN date SET STATISTICS 1000;

Now the queries are running very fast, down to 300ms from 5s.

I'll be looking into patching postgres as well.

Postgresql – Postgres Performance Over Group by with MAX and MIN

The following indexes may help you:

CREATE INDEX stkx_custloc_ind1
  ON customer_location
  USING btree
  (location_id, customer_id);

CREATE INDEX stkx_custrecpt_ind1
  ON customer_receipt
  USING btree
  (customer_id, receipt_id);

CREATE INDEX stkx_recptprod_ind1
  ON receipt_product
  USING btree
  (receipt_id, status COLLATE pg_catalog."default", is_active, txn_dt_tm);

CREATE INDEX stkx_recptprod_ind2
  ON receipt_product
  USING btree
  (receipt_id, txn_dt_tm, status COLLATE pg_catalog."default", is_active);

Please note, in addition, that your query may be better tuned with an inline view:

SELECT 
        cr.customer_id, 
        min(min_txn_dt_tm), 
        max(max_txn_dt_tm) 
FROM customer_location cl 
JOIN customer_receipt cr ON cl.customer_id = cr.customer_id AND cl.location_id = 'UUID HERE'
JOIN (
        select re_pr.receipt_id, min(txn_dt_tm) as min_txn_dt_tm, max(txn_dt_tm) as max_txn_dt_tm from receipt_product re_pr where status = 'PURCHASED' 
    AND is_active = true group by re_pr.receipt_id) rp ON rp.receipt_id = cr.receipt_id
GROUP BY cr.customer_id
;

Your schema suffers from having the 'id' antipattern. With this you have made a programmer's life easier until the application allows you to enter unexpected data, such as multiple customers having the same receipts.

I hope this helps!

Best Answer

Related Solutions

Postgresql – Postgres 8.4.5: Bad query plan vs Good query plan when changing date range on large data table

Postgresql – Postgres Performance Over Group by with MAX and MIN

Related Question