Postgresql – Storing the groupwise most recent related record

optimizationperformancepostgresqlquery-performance

I have two tables, customers and purchases. There are a lot (thousands) of purchases per customer. I usually only need the most recent purchase for each customer, which is why I have the "latest_purchase_id" column and update it whenever I add a purchase.

I'd rather not have to maintain "latest_purchase_id", so I've been testing queries. They've all ended up being much, much slower and I'm not sure why.

Customers:

       Column        |  Type    |                       Modifiers                        | Storage  | Stats target | Description
---------------------+----------+--------------------------------------------------------+----------+--------------+-------------
 id                  | integer  | not null default nextval('customers_id_seq'::regclass) | plain    |              |
 latest_purchase_id  | integer  |                                                        | plain    |              |
Indexes:
    "customers_pkey" PRIMARY KEY, btree (id)
    "customers_latest_purchase_id" btree (latest_purchase_id)
Foreign-key constraints:
    "customers_latest_purchase_fk" FOREIGN KEY (latest_purchase_id) REFERENCES purchases(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "purchases" CONSTRAINT "purchases_customer_fk" FOREIGN KEY (customer_id) REFERENCES customers(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

Purchases:

     Column   |  Type     |                        Modifiers                       | Storage  | Stats target | Description
--------------+-----------+--------------------------------------------------------+----------+--------------+-------------
 id           | integer   | not null default nextval('purchases_id_seq'::regclass) | plain    |              |
 customer_id  | integer   |                                                        | plain    |              |
Indexes:
    "purchases_pkey" PRIMARY KEY, btree (id)
    "purchases_id_customer_id" btree (id, customer_id)
    "purchases_customer_id" btree (customer_id)
Foreign-key constraints:
    "purchases_customer_fk" FOREIGN KEY (customer_id) REFERENCES customers(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "customers" CONSTRAINT "customers_latest_purchase_id" FOREIGN KEY (latest_purchase_id) REFERENCES purchases(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no
SELECT customers.id, purchases.id 
FROM customers 
   JOIN purchases ON customers.latest_purchase_id = purchases.id;

48ms

SELECT DISTINCT ON (customer_id) id, customer_id
FROM purchases
ORDER BY customer_id, id DESC;

1040ms

SELECT customers.id, p.id
FROM customers INNER JOIN (
    SELECT RANK()
    OVER (PARTITION BY customer_id ORDER BY id DESC) r, *
    FROM purchases
) p
ON customers.id = p.customer_id
WHERE p.r = 1;

836ms

SELECT customers.id, p1.id
FROM customers
JOIN purchases p1 ON customers.id = p1.customer_id
LEFT OUTER JOIN purchases p2 ON (customers.id = p2.customer_id and p1.id < p2.id)
WHERE p2.id IS NULL;

1833ms

SELECT customers.id, p.id
FROM customers CROSS JOIN LATERIAL (
    SELECT purchases.id, purchases.customer_id
    FROM purchases
    WHERE purchases.customer_id = customers.id
    ORDER BY purchases.id DESC
    LIMIT 1
) p;

23442ms

As you can see, "latest_purchase_id" is way faster than anything else. The performance benefit is obviously a tradeoff, since purchase insertions will take around twice as long (I improved this significantly with a trigger below). The query is also limited to strictly the most recent purchases. No changing the query on the fly to match the most recent purchases over a certain transaction value.

Is there a reason the other queries are so slow, even with the indexes I have set up? I essentially just need to find the max purchase ID for each customer ID, which the "purchases_id_customer_id" index should be able to handle easily.

Here's the explain analyze output for the first two queries:

EXPLAIN ANALYZE SELECT customers.id, purchases.id FROM customers JOIN purchases ON customers.latest_purchase_id = purchases.id;
 Nested Loop  (cost=0.42..11643.46 rows=3422 width=8) (actual time=0.961..72.014 rows=340 loops=1)
   ->  Seq Scan on customers  (cost=0.00..93.22 rows=3422 width=8) (actual time=0.010..1.239 rows=3420 loops=1)
   ->  Index Only Scan using purchases_pkey on purchases  (cost=0.42..3.38 rows=1 width=4) (actual time=0.020..0.020 rows=0 loops=3420)
         Index Cond: (id = d.latest_purchase_id)
         Heap Fetches: 137
 Planning Time: 0.681 ms
 Execution Time: 72.134 ms
EXPLAIN ANALYZE SELECT DISTINCT ON (customer_id) id, customer_id FROM purchases ORDER BY customer_id, id DESC;
 Unique  (cost=78791.68..81715.56 rows=157 width=8) (actual time=1092.279..1434.771 rows=407 loops=1)
   ->  Sort  (cost=78791.68..80253.62 rows=584777 width=8) (actual time=1092.277..1291.642 rows=585790 loops=1)
         Sort Key: customer_id, id DESC
         Sort Method: external merge  Disk: 8304kB
         ->  Seq Scan on purchases  (cost=0.00..14779.77 rows=584777 width=8) (actual time=0.736..610.967 rows=585790 loops=1)
 Planning Time: 0.098 ms
 Execution Time: 1436.267 ms

Edit: I corrected the index to (customer_id, id), but it's still quite slow. It's more data now, so the times aren't completely comparable, but it's still not close to the trigger method.

EXPLAIN ANALYZE SELECT DISTINCT ON (customer_id) id, customer_id FROM purchases ORDER BY customer_id, id;
 Result  (cost=0.43..162525.52 rows=381 width=8) (actual time=0.513..1461.147 rows=823 loops=1)
   ->  Unique  (cost=0.43..162525.52 rows=381 width=8) (actual time=0.510..1460.719 rows=823 loops=1)
         ->  Index Only Scan using purchases_customer_id_id_idx on purchases  (cost=0.43..157859.86 rows=1866267 width=8) (actual time=0.508..981.186 rows=1866213 loops=1)
               Heap Fetches: 1363609
 Planning Time: 0.096 ms
 Execution Time: 1461.359 ms
(6 rows)

Best Answer

After looking some more, I discovered sql triggers and figured out how to update latest_purchase_id with one. That takes away a lot of the hassle and performance loss during insertions, but I'm still not sure why the other queries are performing so poorly.

CREATE OR REPLACE FUNCTION latest_purchase_func() RETURNS trigger AS
$BODY$
DECLARE
    CustomerID INT;
    PurchaseID INT;
BEGIN
    SELECT n.id, n.customer_id INTO PurchaseID, CustomerID
        FROM new_table n ORDER BY n.id DESC LIMIT 1;
    UPDATE customers SET "latest_purchase_id" = PurchaseID
        WHERE "customers"."id" = CustomerID;
    RETURN NULL;
END
$BODY$
LANGUAGE plpgsql;

CREATE trigger latest_purchase_ins
AFTER INSERT ON purchases
REFERENCING NEW TABLE AS new_table
FOR EACH STATEMENT
EXECUTE FUNCTION latest_purchase_func();