Postgresql – Storing the groupwise most recent related record

optimizationperformancepostgresqlquery-performance

I have two tables, customers and purchases. There are a lot (thousands) of purchases per customer. I usually only need the most recent purchase for each customer, which is why I have the "latest_purchase_id" column and update it whenever I add a purchase.

I'd rather not have to maintain "latest_purchase_id", so I've been testing queries. They've all ended up being much, much slower and I'm not sure why.

Customers:

       Column        |  Type    |                       Modifiers                        | Storage  | Stats target | Description
---------------------+----------+--------------------------------------------------------+----------+--------------+-------------
 id                  | integer  | not null default nextval('customers_id_seq'::regclass) | plain    |              |
 latest_purchase_id  | integer  |                                                        | plain    |              |
Indexes:
    "customers_pkey" PRIMARY KEY, btree (id)
    "customers_latest_purchase_id" btree (latest_purchase_id)
Foreign-key constraints:
    "customers_latest_purchase_fk" FOREIGN KEY (latest_purchase_id) REFERENCES purchases(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "purchases" CONSTRAINT "purchases_customer_fk" FOREIGN KEY (customer_id) REFERENCES customers(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

Purchases:

     Column   |  Type     |                        Modifiers                       | Storage  | Stats target | Description
--------------+-----------+--------------------------------------------------------+----------+--------------+-------------
 id           | integer   | not null default nextval('purchases_id_seq'::regclass) | plain    |              |
 customer_id  | integer   |                                                        | plain    |              |
Indexes:
    "purchases_pkey" PRIMARY KEY, btree (id)
    "purchases_id_customer_id" btree (id, customer_id)
    "purchases_customer_id" btree (customer_id)
Foreign-key constraints:
    "purchases_customer_fk" FOREIGN KEY (customer_id) REFERENCES customers(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "customers" CONSTRAINT "customers_latest_purchase_id" FOREIGN KEY (latest_purchase_id) REFERENCES purchases(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

SELECT customers.id, purchases.id 
FROM customers 
   JOIN purchases ON customers.latest_purchase_id = purchases.id;

48ms

SELECT DISTINCT ON (customer_id) id, customer_id
FROM purchases
ORDER BY customer_id, id DESC;

1040ms

SELECT customers.id, p.id
FROM customers INNER JOIN (
    SELECT RANK()
    OVER (PARTITION BY customer_id ORDER BY id DESC) r, *
    FROM purchases
) p
ON customers.id = p.customer_id
WHERE p.r = 1;

836ms

SELECT customers.id, p1.id
FROM customers
JOIN purchases p1 ON customers.id = p1.customer_id
LEFT OUTER JOIN purchases p2 ON (customers.id = p2.customer_id and p1.id < p2.id)
WHERE p2.id IS NULL;

1833ms

SELECT customers.id, p.id
FROM customers CROSS JOIN LATERIAL (
    SELECT purchases.id, purchases.customer_id
    FROM purchases
    WHERE purchases.customer_id = customers.id
    ORDER BY purchases.id DESC
    LIMIT 1
) p;

23442ms

As you can see, "latest_purchase_id" is way faster than anything else. The performance benefit is obviously a tradeoff, since purchase insertions will take around twice as long (I improved this significantly with a trigger below). The query is also limited to strictly the most recent purchases. No changing the query on the fly to match the most recent purchases over a certain transaction value.

Is there a reason the other queries are so slow, even with the indexes I have set up? I essentially just need to find the max purchase ID for each customer ID, which the "purchases_id_customer_id" index should be able to handle easily.

Here's the explain analyze output for the first two queries:

EXPLAIN ANALYZE SELECT customers.id, purchases.id FROM customers JOIN purchases ON customers.latest_purchase_id = purchases.id;
 Nested Loop  (cost=0.42..11643.46 rows=3422 width=8) (actual time=0.961..72.014 rows=340 loops=1)
   ->  Seq Scan on customers  (cost=0.00..93.22 rows=3422 width=8) (actual time=0.010..1.239 rows=3420 loops=1)
   ->  Index Only Scan using purchases_pkey on purchases  (cost=0.42..3.38 rows=1 width=4) (actual time=0.020..0.020 rows=0 loops=3420)
         Index Cond: (id = d.latest_purchase_id)
         Heap Fetches: 137
 Planning Time: 0.681 ms
 Execution Time: 72.134 ms

EXPLAIN ANALYZE SELECT DISTINCT ON (customer_id) id, customer_id FROM purchases ORDER BY customer_id, id DESC;
 Unique  (cost=78791.68..81715.56 rows=157 width=8) (actual time=1092.279..1434.771 rows=407 loops=1)
   ->  Sort  (cost=78791.68..80253.62 rows=584777 width=8) (actual time=1092.277..1291.642 rows=585790 loops=1)
         Sort Key: customer_id, id DESC
         Sort Method: external merge  Disk: 8304kB
         ->  Seq Scan on purchases  (cost=0.00..14779.77 rows=584777 width=8) (actual time=0.736..610.967 rows=585790 loops=1)
 Planning Time: 0.098 ms
 Execution Time: 1436.267 ms

Edit: I corrected the index to (customer_id, id), but it's still quite slow. It's more data now, so the times aren't completely comparable, but it's still not close to the trigger method.

EXPLAIN ANALYZE SELECT DISTINCT ON (customer_id) id, customer_id FROM purchases ORDER BY customer_id, id;
 Result  (cost=0.43..162525.52 rows=381 width=8) (actual time=0.513..1461.147 rows=823 loops=1)
   ->  Unique  (cost=0.43..162525.52 rows=381 width=8) (actual time=0.510..1460.719 rows=823 loops=1)
         ->  Index Only Scan using purchases_customer_id_id_idx on purchases  (cost=0.43..157859.86 rows=1866267 width=8) (actual time=0.508..981.186 rows=1866213 loops=1)
               Heap Fetches: 1363609
 Planning Time: 0.096 ms
 Execution Time: 1461.359 ms
(6 rows)

Best Answer

After looking some more, I discovered sql triggers and figured out how to update latest_purchase_id with one. That takes away a lot of the hassle and performance loss during insertions, but I'm still not sure why the other queries are performing so poorly.

CREATE OR REPLACE FUNCTION latest_purchase_func() RETURNS trigger AS
$BODY$
DECLARE
    CustomerID INT;
    PurchaseID INT;
BEGIN
    SELECT n.id, n.customer_id INTO PurchaseID, CustomerID
        FROM new_table n ORDER BY n.id DESC LIMIT 1;
    UPDATE customers SET "latest_purchase_id" = PurchaseID
        WHERE "customers"."id" = CustomerID;
    RETURN NULL;
END
$BODY$
LANGUAGE plpgsql;

CREATE trigger latest_purchase_ins
AFTER INSERT ON purchases
REFERENCING NEW TABLE AS new_table
FOR EACH STATEMENT
EXECUTE FUNCTION latest_purchase_func();

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

I see a couple of issues.

The biggest one is that PG is using a sequence scan on A when filtering A. I think you need a composite index on A.flag AND A.strvalue. If there is already an index available, PostgreSQL is choosing not to use it for some reason. This seems to be eating up 92% of your cost estimate and is likely what's making it run for so long.

As for the ILIKE, PostgreSQL cannot natively (but see below for a module that can) use an index as long as your wildcard is the first character. That's simply a restriction on the ILIKE operator. For that reason you are getting a sequence scan which means every single row is being loaded and the C.name column is being scanned for characters. But one thing that's weird is that the ILIKE sequence scan doesn't seem to be eating up much of the cost estimate in this query plan. Anyway, if it is the ILIKE operator causing the slowdown, I would consider rewriting your query so that it somehow looks like this: ILIKE 'value%' or else consider using PostgreSQL's full text search.

UPDATED

The ILIKE operator can use a trigram index. Superb!

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question