SQL Join – Joining Products and Orders with Quantity Ordered by Buyer

join;

Table A is a list of "products" and table B is a list of "orders" referencing products (using product_id), along with buyers informations (buyer_name ecc) and the quantity of each product buyed.

An order has many products and a single buyer, so that when a buyer issues an order, a row with the same buyer_name and order_no is inserted into table B for each product ordered along with the desired quantity.

Which are the two fastest queries to join "products" and "orders" so that for every "products" row I have also the quantity that a specified buyer has ordered:

in a single order (order_no is specified)
in all his orders (sum quantity in every order issued by the buyer)

To be efficient, all the "orders" rows not belonging to the specified buyer (and not having the specified order_no for the first query) should be excluded before the join, but every row from "products" should be returned, even if the product has not been ordered from the buyer (quantity should be 0 in this case).

UPDATE (from comments):

Current query:

SELECT p.*, o.buyer_name, coalesce(o.quantity,0)
  FROM products p
         LEFT JOIN (SELECT *
                      FROM orders
                     WHERE buyer_name='Joe'
                      AND order_no=123
                   ) AS o USING(product_id)

I'm not having a performance issue now, and am not looking to tightly tune this specific query, but to ensure I'm writing queries efficiently in general. Similarly, current DBMS is PostgreSQL 9.4.4 – but looking for generic advice that would apply to most SQL variants.

Best Answer

In general, it's best to leave the specifics of how the data is joined together to the DB engine - usually, there is an optimizer that looks at table/index statistics, and figures out the most efficient way to eliminate rows from consideration. If you're having issues with performance on a specific query, you can sometimes force the engine to use a particular method/option on a given query - but this can lead to good performance in specific cases and bad performance in everything else.

That said, the logic of your query is such that rows from your "right" table should be excluded from consideration if they don't belong to the targeted buyer/order, so they do need to be included as a part of the LEFT JOIN clause. I would rework that as follows:

  FROM products p
         LEFT JOIN orders o ON (    p.product_id = o.product_id
                                AND o.buyer_name = 'Joe'
                                AND o.order_no = 123
                               )

If you simply join the two tables, the DB engine will figure out what columns it needs out of those tables from the rest of the query. I can't think of a good reason to use a sub-query that just selects specific columns from a table in a join, rather than using the table itself directly. I won't say there isn't such a reason, just that I have never encountered one.

Some versions of SQL may materialize a subquery - basically, running the subquery before the main query is evaluated, then evaluating the main query from those results. If you do choose to use a subquery, I'd recommend not using SELECT *, but rather specifying just the actual columns you need in the main query (SELECT product_id, order_qty from orders...), to avoid pulling in data you aren't actually going to use.

That same advice holds for the main query; rather that using SELECT *, specify the actual columns you need. As noted before, the DB engine will figure out what columns it needs from each table by looking at the main query's SELECT list, the WHERE clause, the JOIN criteria, and so forth. If you specify more columns in the SELECT list than you really need, then the engine will give them to you. If each row of data in your table has two varchar fields that average 250 characters each, but all you really need is a varchar(25) field and three integers, you'll be returning a result set that's around 10 times as large as it needs to be (which will take about ten times as much time to reach the machine that made the request).

Limiting your columns to those you actually need also will allowing the DB engine's optimizer to apply certain shortcuts that can improve performance. For example, if your table's rows are (on average) 1000 bytes wide, but you have an index that includes just the columns you need, and those columns only average 100 bytes per row, the optimizer can use that index as if it were the table, and thus may be able to reduce the number of database pages it has to read to get your data by a factor of 10. the term for this is a covering index.

It's also a good idea to assign column names to computed columns - it makes it easier to refer to them elsewhere.

So, we'd end up with the following queries (note that I'm assuming you only will be using a column named product_name in your results; if other columns from products apply, add them to the SELECT list (and to the ORDER BY clause):

For a specific order:

SELECT p.product_name
      ,o.buyer_name
      ,COALESCE(o.quantity,0) as Qty_Ordered
  FROM products p
         LEFT JOIN orders o ON (    p.product_id = o.product_id
                                AND o.buyer_name = 'Joe'
                                AND o.order_no = 123
                               )

For a buyer's total activity:

SELECT p.product_name
      ,o.buyer_name
      ,COALESCE(SUM(o.quantity),0) as Qty_Ordered
  FROM products p
         LEFT JOIN orders o ON (    p.product_id = o.product_id
                                AND o.buyer_name = 'Joe'
                               )
 GROUP BY p.product_name, o.buyer_name

A few final notes:

In some SQL variants, every column that is not in an aggregate function in the SELECT list must appear in the GROUP BY clause. Other versions allow you to leave columns out - the value from a "random" row will be returned. If you've got a unique value from a table in the GROUP BY, it would be safe to leave other columns from that table out if using one of these other versions, as all possible values from the rows being aggregated would have to be the same.

Also, your query makes an assumption: any order will only have one line item for a given product. Now, that seems like a reasonable assumption on the surface; however, I have seen systems where someone might place an order for 10 widgets on the 1st of the month, and an additional 10 on the 15th. If there's any chance that a product might show up twice in the same order, you can use the second query to pull the order-specific information as well; you just need to put AND o.order_no = 123 back in, like in the first query. This would probably take slightly longer than the query without the SUM and the ORDER BY - but I would expect, in most cases, the difference would be too small for a human being to tell.

Possible deadlocks

This scenario can easily lead to deadlocks. Say, you have two orders coming in at the same time, both want product A and B. The first order starts with A, the second starts with B. Then the two transactions block each other out. Each waits for the other to complete. A deadlock ensues.

In PostgreSQL a transaction will wait for some time when it is stalled by locks. Depending on your setting of deadlock_timeout (default is 1 sesond), checks for possible deadlocks are performed.

Once detected, one transaction is aborted and a deadlock exception reported. The other one can finish. Which one is hard to predict.

There is a simple way to avoid this kind of deadlocks: Always place your order_details in a consistent order. Like products ordered by product_id. This way, the above scenario can never happen.

MySQL – How to Simulate a Table Join on Itself

You'll need to first create a list of every product_number and date combination. You can do this using a CROSS JOIN of your table:

select distinct p.product_number, d.date
from yourtable p
cross join yourtable d;

See SQL Fiddle with Demo. This will create a list of data similar to:

| PRODUCT_NUMBER |                            DATE |
|----------------|---------------------------------|
|            100 |  January, 01 2010 00:00:00+0000 |
|            200 |  January, 01 2010 00:00:00+0000 |
|            100 | February, 01 2010 00:00:00+0000 |

You will then use the above query and LEFT JOIN to your table to return the final result:

select 
  pd.product_number,
  pd.date,
  t.value
from
(
  -- list of every product/date
  select distinct p.product_number, d.date
  from yourtable p
  cross join yourtable d
) pd
left join yourtable t
  on pd.date = t.date
  and pd.product_number = t.product_number
order by pd.product_number, pd.date;

See SQL Fiddle with Demo. Giving a final result of:

| PRODUCT_NUMBER |                            DATE |  VALUE |
|----------------|---------------------------------|--------|
|            100 |  January, 01 2010 00:00:00+0000 |      1 |
|            100 | February, 01 2010 00:00:00+0000 |      1 |
|            100 |    March, 01 2010 00:00:00+0000 |      1 |
|            200 |  January, 01 2010 00:00:00+0000 |      1 |
|            200 | February, 01 2010 00:00:00+0000 | (null) |
|            200 |    March, 01 2010 00:00:00+0000 |      1 |

The LEFT JOIN returns all rows from your list of products and dates regardless of whether a matching row exists in the other table.

This could also be written as:

select 
  p.product_number,
  d.date,
  t.value
from
(
  -- list of every product
  select distinct product_number
  from yourtable 
) p  
cross join
( 
  -- list of every date
  select distinct date
  from yourtable 
) d
  -- then join to the table
left join yourtable t
  on  d.date = t.date
  and p.product_number = t.product_number
order by p.product_number, d.date ;

See SQL Fiddle with Demo. This may have better performance depending on your table size.

Now if you wanted to return a list of all dates, regardless of whether or not they appear in the table, then I would suggest creating a table of dates. This table would be used in a similar manner to create a list of all dates/products which you would then join.

The table would be similar to:

CREATE TABLE dates
    (`date` datetime)
;

INSERT INTO dates
    (`date`)
VALUES
    ('2010-01-01 00:00:00'),
    ('2010-02-01 00:00:00'),
    ('2010-03-01 00:00:00'),
    ('2010-04-01 00:00:00'),
    ('2010-05-01 00:00:00')
;

You'd then use the following query to get the list of dates/products:

select distinct p.product_number, d.date
from yourtable p
cross join dates d

And finally, you would join that back to your table:

select 
  pd.product_number,
  pd.date,
  t.value
from
(
  -- list of every product/date
  select distinct p.product_number, d.date
  from yourtable p
  cross join dates d
) pd
left join yourtable t
  on pd.date = t.date
  and pd.product_number = t.product_number
order by pd.product_number, pd.date;

See SQL Fiddle with Demo. Or an alternative:

select 
  p.product_number,
  d.date,
  t.value
from
(
  -- list of every product
  select distinct product_number
  from yourtable 
) p  
cross join
 dates
  d
  -- then join to the table
left join yourtable t
  on  d.date = t.date
  and p.product_number = t.product_number
order by p.product_number, d.date ;

See SQL Fiddle with Demo. Again this may have better performance based on the table size. Using this type of solution, you'd return all dates even those not in your table:

| PRODUCT_NUMBER |                            DATE |  VALUE |
|----------------|---------------------------------|--------|
|            100 |  January, 01 2010 00:00:00+0000 |      1 |
|            100 | February, 01 2010 00:00:00+0000 |      1 |
|            100 |    March, 01 2010 00:00:00+0000 |      1 |
|            100 |    April, 01 2010 00:00:00+0000 | (null) |
|            100 |      May, 01 2010 00:00:00+0000 | (null) |
|            200 |  January, 01 2010 00:00:00+0000 |      1 |
|            200 | February, 01 2010 00:00:00+0000 | (null) |
|            200 |    March, 01 2010 00:00:00+0000 |      1 |
|            200 |    April, 01 2010 00:00:00+0000 | (null) |
|            200 |      May, 01 2010 00:00:00+0000 | (null) |

Best Answer

Related Solutions

PostgreSQL – Validate Availability for New Order

Possible deadlocks

MySQL – How to Simulate a Table Join on Itself

Related Question