Postgresql – Displaying a value associated with a maximum value

aggregategreatest-n-per-groupmaxpostgresql

I have a PostgreSQL database for a supermarket (it's a toy problem) and I need to find which outlet has sold the most copies of each product and display it in a query which also shows the product name, description, copies in stock across all stores, copies purchased across all stores.

I think I have the first few columns queried appropriately, as below:

SELECT a.ProductName, a.ProductDescription, a.StockSum, b.PurchaseSum, c.MaxSales
FROM (SELECT Product.Name AS ProductName, Product.Description AS ProductDescription
           , SUM(Stock.copies) AS StockSum
FROM Product
INNER JOIN Stock
ON Stock.product_id = Product.product_id
GROUP BY Product.name, Product.description) AS a
FULL JOIN
(SELECT Product.name AS ProductName, Product.description AS ProductDescription
      , SUM(PurchaseItem.copies) AS PurchaseSum
FROM Product
INNER JOIN PurchaseItem
ON PurchaseItem.product_id = Product.product_id
GROUP BY Product.name, Product.description) AS b;

But I cannot for the life of me work out how to pull the appropriate Outlet.name associated with the MAX of the SUM of all PurchaseItem.copies corresponding to a particular product_id at a particular store. It seems like a very complex query, and it has flummoxed me!

The structure of the database is that Purchase references Outlet, while PurchaseItem references Purchase and a single Product (with PurchaseItem.copies recording the number of sales of that product in the purchase).

Best Answer

You could use a window function here, but I would think there's actually a better solution with DISTINCT ON.

First I simplified what you had so far:

SELECT p.name AS product_name, p.description AS product_description
     , a.stock_sum, b.purchase_sum
     , c.max_sales, o.outlet_name   -- still missing
FROM   Product p
LEFT   JOIN (
   SELECT product_id, SUM(copies) AS stock_sum
   FROM   Stock
   GROUP  BY 1
   ) a USING (product_id)
LEFT   JOIN (
   SELECT product_id, sum(copies) AS purchase_sum
   FROM   PurchaseItem
   GROUP  BY 1
   ) b USING (product_id)
-- c, o still  missing

It should be considerably faster to aggregate counts before joining:

Also, LEFT JOIN retains products in the result that don't have any purchases, yet, or aren't in stock any more.

Then add the missing parts:

LEFT   JOIN (
   SELECT DISTINCT ON (product_id)
          pi.product_id, pu.outlet_id, sum(copies) AS max_sales
   FROM   Purchase      pu
   JOIN   PurchaseItem  pi USING (purchase_id)
   GROUP  BY 1, 2
   ORDER  BY 1, sum(copies) DESC NULLS LAST
   ) c  USING (product_id)
LEFT   JOIN Outlet o USING (outlet_id);

About DISTINCT ON:

You can run DISTINCT over the results of the aggregation. Consider the sequence of events in a query:

Optimize performance

It's probably cheaper to scan PurchaseItem only once, using a CTE. But this also adds some overhead. You'll have to test which is faster:

WITH ct AS (
   SELECT pi.product_id, pu.outlet_id, sum(pi.copies) AS sales
   FROM   PurchaseItem  pi 
   JOIN   Purchase      pu USING (purchase_id)
   GROUP  BY 1, 2
   )
SELECT p.name AS product_name, p.description AS product_description
     , a.stock_sum, b.purchase_sum
     , c.max_sales, o.outlet_name
FROM   Product p
LEFT   JOIN (
   SELECT product_id, SUM(copies) AS stock_sum
   FROM   Stock
   GROUP  BY 1
   ) a USING (product_id)
LEFT   JOIN (
   SELECT product_id, sum(sales) AS purchase_sum
   FROM   ct
   GROUP  BY 1
   ) b USING (product_id)
LEFT   JOIN (
   SELECT DISTINCT ON (product_id)
          product_id, outlet_id, sales AS max_sales
   FROM   ct
   ORDER  BY product_id, sales DESC
   ) c  USING (product_id)
LEFT   JOIN Outlet o USING (outlet_id);

Test performance with EXPLAIN ANALYZE (a couple of times to exclude caching effects).