Sql-server – How to bring data from two tables that could/could not be present in one of them

group byjoin;sql server

Background: I have two tables, let's call them "purchaseTransactions" and "Products", the first one, holds all the transaction made by product, and the second one has all my products available, let's see an example oh how the look:

PurchaseTransactions

  • PurchaseDate
  • ProductId
  • Quantity
  • Some other fields…

Products

  • ProductName
  • ProductNumber
  • Some other fields…

Basically what I want to obtain is the quantity sold by month and product from a specific timelapse, For instance:

Let's say we have the Products

  • Name: Product A, No: 01
  • Name: Product B, No: 02
  • Name: Product C, No: 03

For Purchase Orders

  • PurchaseDate: Jan/01/2018, ProductId: Product A, Quantity: 100
  • PurchaseDate: Jan/18/2018, ProductId: Product A, Quantity: 150
  • PurchaseDate: Feb/02/2018, ProductId: Product A, Quantity: 150
  • PurchaseDate: Feb/24/2018, ProductId: Product A, Quantity: 200
  • PurchaseDate: April/04/2018, ProductId: Product A, Quantity: 200

  • PurchaseDate: January/01/2018, ProductId: Product B, Quantity: 222

  • PurchaseDate: March/02/2018, ProductId: Product B, Quantity 250
  • PurchaseDate: May/04/2018, ProductId: Product B, Quantity: 333
  • PurchaseDate: May/20/2018, ProductId: Product B, Quantity: 666

An example of what I want is to bring ALL the products sold grouped by Year,Month, Product and Sum of its quantity. Something like this.

 Year Month, ProductId, QuantitySold
 -----------------------------------
 2018 Jan   Product A   250
 2018 Jan   Product B   222
 2018 Jan   Product C   0
 2018 Feb   Product A   250
 2018 Feb   Product B   0
 2018 Feb   Product C   0
 2018 March Product A   0
 2018 March Product B   0
 2018 March Product C   250
 .... etc

Notice that the query should return even the products that does not have any purchases in the specified month, this is what I have tried so far:

SELECT YEAR(PurchaseDate), MONTH(PurchaseDate), ProductId, SUM(Quantity)
FROM PurchaseOrders
WHERE Product IN (Product A, Product B...) AND MONTH(PurchaseDate) => DATEADD(MONTH, -3, GETDATE()) --Could be by some specific products and for a specific timelapse, ex the past three months.
GROUP BY YEAR(PurchaseDate), MONTH(PurchaseDate), ProductId

But this only returns the Products existing in the Purchase table, I have tried with a LEFT JOIN with the Products table to get all the products, but I think I am not implementing it correctly.

Any Idea of how to do this?

Best Answer

You typically use a LEFT JOIN when you want all rows from the left table in a join. Example:

SELECT YEAR(o.PurchaseDate), MONTH(o.PurchaseDate), p.ProductId
     , SUM(o.Quantity)
FROM Products p
LEFT JOIN PurchaseOrders o
    ON p.ProductId = o.ProductId
WHERE Product IN ...
  AND MONTH(o.PurchaseDate) =>  DATEADD(MONTH, -3, GETDATE())
GROUP BY YEAR(o.PurchaseDate), MONTH(o.PurchaseDate), p.ProductId
ORDER BY ...

What happens here is that you get NULL in columns from the right when there is no match. To fix this we can add a year- and a month- table that contains the domain for these. I'm using a CTE here, but it can be a materialized table as well.

WITH Y(x) as ( values (2017),(2018),...) -- years of interest
     M(x) as ( values ('Jan'),('Feb'),...) -- month
SELECT Y.x, M.x, p.ProductId, sum(o.Quantity)
FROM Y
CROSS JOIN M
CROSS JOIN Products p
LEFT JOIN PurchaseOrders o
    ON p.ProductId = o.ProductId
   AND y.x = YEAR(o.PurchaseDate)
   AND m.x = MONTH(o.PurchaseDate) 
WHERE ...  
GROUP BY ...
ORDER BY ...

CROSS JOIN is the cartesian product between 2 tables, it is the same as a JOIN where the join predicate is always true:

A JOIN B ON true

Here:

FROM Y
CROSS JOIN M
CROSS JOIN Products p

becomes a new table with all years, months and products. When we LEFT JOIN this table with PurchaseOrders, we get a table with all years, months and products together with the Quantity where theres been a sale.

You will still get a null for the sum if there is no purchase, but that can be fixed with COALESCE. Coalesce is a function that returns it's left-most argument that is not null:

WITH Y(x) as ( values (2017),(2018),...) -- years of interest
     M(x) as ( values ('Jan'),('Feb'),...) -- month
SELECT Y.x, M.x, p.ProductId, COALESCE(sum(o.Quantity), 0) as Q...
FROM Y
CROSS JOIN M
CROSS JOIN Products p
LEFT JOIN PurchaseOrders o
    ON p.ProductId = o.ProductId
   AND y.x = YEAR(o.PurchaseDate)
   AND m.x = MONTH(o.PurchaseDate) 
WHERE ...  
GROUP BY ...
ORDER BY ...

You may also have a look at GROUPING SETS in case you want to roll up the summary for different dimensions. Example for total orders of each product, total orders per month, ...

GROUP BY GROUPING SETS ((Y.x, M.x, p.ProductId) -- current grouping
                       ,(Y.x, M.x)              -- all product per month
                       ,(p.ProductId)           -- total per product
                       ,()                      -- total 
                       )

You can use COALESCE to map null for the dimensions to a 'Total':

WITH Y(x) as ( values (2017),(2018),...) -- years of interest
     M(x) as ( values ('Jan'),('Feb'),...) -- month
SELECT Y.x  -- this is an int, leave as null
     , COALESCE(M.x, 'Total')
     , COALESCE(p.ProductId, 'Total')
     , COALESCE(sum(o.Quantity), 0) as Q...
...
GROUP BY GROUPING SETS ((Y.x, M.x, p.ProductId) -- current grouping
                       ,(Y.x, M.x)              -- all product per month
                       ,(p.ProductId)           -- total per product
                       ,()                      -- total 
                       )