Postgresql – Displaying a value associated with a maximum value

aggregategreatest-n-per-groupmaxpostgresql

I have a PostgreSQL database for a supermarket (it's a toy problem) and I need to find which outlet has sold the most copies of each product and display it in a query which also shows the product name, description, copies in stock across all stores, copies purchased across all stores.

I think I have the first few columns queried appropriately, as below:

SELECT a.ProductName, a.ProductDescription, a.StockSum, b.PurchaseSum, c.MaxSales
FROM (SELECT Product.Name AS ProductName, Product.Description AS ProductDescription
           , SUM(Stock.copies) AS StockSum
FROM Product
INNER JOIN Stock
ON Stock.product_id = Product.product_id
GROUP BY Product.name, Product.description) AS a
FULL JOIN
(SELECT Product.name AS ProductName, Product.description AS ProductDescription
      , SUM(PurchaseItem.copies) AS PurchaseSum
FROM Product
INNER JOIN PurchaseItem
ON PurchaseItem.product_id = Product.product_id
GROUP BY Product.name, Product.description) AS b;

But I cannot for the life of me work out how to pull the appropriate Outlet.name associated with the MAX of the SUM of all PurchaseItem.copies corresponding to a particular product_id at a particular store. It seems like a very complex query, and it has flummoxed me!

The structure of the database is that Purchase references Outlet, while PurchaseItem references Purchase and a single Product (with PurchaseItem.copies recording the number of sales of that product in the purchase).

Best Answer

You could use a window function here, but I would think there's actually a better solution with DISTINCT ON.

First I simplified what you had so far:

SELECT p.name AS product_name, p.description AS product_description
     , a.stock_sum, b.purchase_sum
     , c.max_sales, o.outlet_name   -- still missing
FROM   Product p
LEFT   JOIN (
   SELECT product_id, SUM(copies) AS stock_sum
   FROM   Stock
   GROUP  BY 1
   ) a USING (product_id)
LEFT   JOIN (
   SELECT product_id, sum(copies) AS purchase_sum
   FROM   PurchaseItem
   GROUP  BY 1
   ) b USING (product_id)
-- c, o still  missing

It should be considerably faster to aggregate counts before joining:

Why does the following join increase the query time significantly?

Also, LEFT JOIN retains products in the result that don't have any purchases, yet, or aren't in stock any more.

Then add the missing parts:

LEFT   JOIN (
   SELECT DISTINCT ON (product_id)
          pi.product_id, pu.outlet_id, sum(copies) AS max_sales
   FROM   Purchase      pu
   JOIN   PurchaseItem  pi USING (purchase_id)
   GROUP  BY 1, 2
   ORDER  BY 1, sum(copies) DESC NULLS LAST
   ) c  USING (product_id)
LEFT   JOIN Outlet o USING (outlet_id);

About DISTINCT ON:

Select first row in each GROUP BY group?

You can run DISTINCT over the results of the aggregation. Consider the sequence of events in a query:

Best way to get result count before LIMIT was applied in PHP/PostgreSQL

Optimize performance

It's probably cheaper to scan PurchaseItem only once, using a CTE. But this also adds some overhead. You'll have to test which is faster:

WITH ct AS (
   SELECT pi.product_id, pu.outlet_id, sum(pi.copies) AS sales
   FROM   PurchaseItem  pi 
   JOIN   Purchase      pu USING (purchase_id)
   GROUP  BY 1, 2
   )
SELECT p.name AS product_name, p.description AS product_description
     , a.stock_sum, b.purchase_sum
     , c.max_sales, o.outlet_name
FROM   Product p
LEFT   JOIN (
   SELECT product_id, SUM(copies) AS stock_sum
   FROM   Stock
   GROUP  BY 1
   ) a USING (product_id)
LEFT   JOIN (
   SELECT product_id, sum(sales) AS purchase_sum
   FROM   ct
   GROUP  BY 1
   ) b USING (product_id)
LEFT   JOIN (
   SELECT DISTINCT ON (product_id)
          product_id, outlet_id, sales AS max_sales
   FROM   ct
   ORDER  BY product_id, sales DESC
   ) c  USING (product_id)
LEFT   JOIN Outlet o USING (outlet_id);

Test performance with EXPLAIN ANALYZE (a couple of times to exclude caching effects).

Related Solutions

PostgreSQL query with max and min date plus associated id per row

You can calculate this in one step with window functions:

CREATE OR REPLACE VIEW daily_trans AS
SELECT DISTINCT
       trans_date
     , first_value(trans_time) OVER w AS first_time
     , first_value(id)         OVER w AS first_id
     , last_value(trans_time)  OVER w AS last_time
     , last_value(id)          OVER w AS last_id
     , calculate_status(min(trans_time) OVER w
                      , max(trans_time) OVER w) AS status 
FROM   trans
WINDOW w AS (PARTITION BY trans_date ORDER BY trans_time, id
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER  BY 1;

I added id as secondary ORDER column in the window clause, to make the sort order stable in case of identical times per day.

If you are not familiar with window functions, be sure to read this chapter of the manual.

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is a not so commonly used clause to adjust the window frame, because we want last_value() to always return the last value of the whole partition (the date), not the last value up to the current (including peers) according to the sort order.

How you to combine DISTINCT with window functions:

PostgreSQL: running count of rows for a query 'by minute'

SQL Server – Returning Only the Row with Maximum Value from Related Rows

You can use window functions:

select id, reservationID swapNumber, oldRoom, newRoom
from (
    select id, reservationID swapNumber, oldRoom, newRoom
         , row_number() over (partition by reservationId
                              order by swapNumber desc) as rn
    from @RoomExchange
) as T
where rn = 1;

By reversing the sort order we can pick the first one for each partition.

Best Answer

Optimize performance

Related Solutions

PostgreSQL query with max and min date plus associated id per row

SQL Server – Returning Only the Row with Maximum Value from Related Rows

Related Question