Postgresql – Postgres Performance Over Group by with MAX and MIN

performancepostgresql-9.2postgresql-9.3query-performance

I just want to retrieve the data for specific location of customers with their ever first and last purchases made in the system.

Table details :

CREATE TABLE customer_location ( id UUID, location_id UUID, customer_id UUID );
CREATE TABLE customer_receipt  ( id UUID, receipt_id UUID, customer_id UUID );
CREATE TABLE receipt_product ( id UUID, receipt_id UUID, product_id UUID, txn_dt_tm Timestamp, status TEXT, is_active boolean);

Query :

SELECT cr.customer_id, min(txn_dt_tm), max(txn_dt_tm) 
FROM customer_location cl 
JOIN customer_receipt cr ON cl.customer_id = cr.customer_id AND cl.location_id = 'UUID here'
JOIN receipt_product rp ON rp.receipt_id = cr.receipt_id AND status = 'PURCHASED' 
    AND is_active = true 
GROUP BY cr.customer_id

INDEX :

receipt_product : Index applied on (status and is_active).

Other tables are implemented with the primary key index.

Though the query engine performs the SEQ scan rather index. Since I see this could be the huge load for index search and thus the engine picked up the Sequential scan. The time taken to load the data is approximately 36 Seconds which is a huge load time.

Here is the query plan http://explain.depesz.com/s/5hV.

HashAggregate  (cost=1023419.340..1023425.980 rows=664 width=24) (actual time=42811.045..42811.152 rows=142 loops=1)   ->  Hash Right Join  (cost=464863.230..1011575.600 rows=1579165 width=24) (actual time=17306.921..41755.433 rows=1101749 loops=1)
          Hash Cond: (receipt_product.receipt_id = customer_receipt.receipt_id)
        ->  Seq Scan on receipt_product  (cost=0.000..405877.220 rows=6981890 width=24) (actual time=0.051..14179.158 rows=6890896 loops=1)
                Filter: (is_active AND (status <> 'PURCHASED'::uuid))
                Rows Removed by Filter: 2909925
        ->  Hash  (cost=434327.670..434327.670 rows=1579165 width=32) (actual time=17306.658..17306.658 rows=1101749 loops=1)
                Buckets: 65536  Batches: 4  Memory Usage: 17257kB
              ->  Hash Right Join  (cost=44.660..434327.670 rows=1579165 width=32) (actual time=2345.366..16229.130 rows=1101749 loops=1)
                      Hash Cond: (customer_receipt.customer_id = customer_location.customer_id)
                      Join Filter: (customer_location.location_id = 'location_id'::uuid)
                      Rows Removed by Join Filter: 7182859
                    ->  Seq Scan on customer_receipt  (cost=0.000..299670.480 rows=8283848 width=32) (actual time=0.004..7416.387 rows=8283831 loops=1)
                    ->  Hash  (cost=33.180..33.180 rows=918 width=32) (actual time=1.551..1.551 rows=918 loops=1)
                            Buckets: 1024  Batches: 1  Memory Usage: 58kB
                          ->  Seq Scan on customer_location  (cost=0.000..33.180 rows=918 width=32) (actual time=0.006..0.759 rows=918 loops=1)

By re-writing the same query as postgres function which would iterate every customer of location and would fetch the max and min date of that customer. Which performs 6 times faster ( 6 seconds aprx. ) but when data grows huge, that would take time.

I strongly hope to see the false bucket here also like to have the experts' Suggestions.

UPDATE

I tried disabled the seq scan. still that takes same amount of time with index scans. So Query plan picked up the right way to execute. What are all the possibilities to reduce the execution time here. Is there any settings which would deal with the execution speed.

I also adding the query tuning setting parameters here from my DB.

cpu_index_tuple_cost        0.005
cpu_operator_cost           0.0025
cpu_tuple_cost              0.01
effective_cache_size        464300
random_page_cost            1
seq_page_cost               1

Thank you.

Best Answer

The following indexes may help you:

CREATE INDEX stkx_custloc_ind1
  ON customer_location
  USING btree
  (location_id, customer_id);

CREATE INDEX stkx_custrecpt_ind1
  ON customer_receipt
  USING btree
  (customer_id, receipt_id);

CREATE INDEX stkx_recptprod_ind1
  ON receipt_product
  USING btree
  (receipt_id, status COLLATE pg_catalog."default", is_active, txn_dt_tm);

CREATE INDEX stkx_recptprod_ind2
  ON receipt_product
  USING btree
  (receipt_id, txn_dt_tm, status COLLATE pg_catalog."default", is_active);

Please note, in addition, that your query may be better tuned with an inline view:

SELECT 
        cr.customer_id, 
        min(min_txn_dt_tm), 
        max(max_txn_dt_tm) 
FROM customer_location cl 
JOIN customer_receipt cr ON cl.customer_id = cr.customer_id AND cl.location_id = 'UUID HERE'
JOIN (
        select re_pr.receipt_id, min(txn_dt_tm) as min_txn_dt_tm, max(txn_dt_tm) as max_txn_dt_tm from receipt_product re_pr where status = 'PURCHASED' 
    AND is_active = true group by re_pr.receipt_id) rp ON rp.receipt_id = cr.receipt_id
GROUP BY cr.customer_id
;

Your schema suffers from having the 'id' antipattern. With this you have made a programmer's life easier until the application allows you to enter unexpected data, such as multiple customers having the same receipts.

I hope this helps!