SQL Join – Joining Products and Orders with Quantity Ordered by Buyer

join;

Table A is a list of "products" and table B is a list of "orders" referencing products (using product_id), along with buyers informations (buyer_name ecc) and the quantity of each product buyed.

An order has many products and a single buyer, so that when a buyer issues an order, a row with the same buyer_name and order_no is inserted into table B for each product ordered along with the desired quantity.

Which are the two fastest queries to join "products" and "orders" so that for every "products" row I have also the quantity that a specified buyer has ordered:

  • in a single order (order_no is specified)
  • in all his orders (sum quantity in every order issued by the buyer)

To be efficient, all the "orders" rows not belonging to the specified buyer (and not having the specified order_no for the first query) should be excluded before the join, but every row from "products" should be returned, even if the product has not been ordered from the buyer (quantity should be 0 in this case).

UPDATE (from comments):

Current query:

SELECT p.*, o.buyer_name, coalesce(o.quantity,0)
  FROM products p
         LEFT JOIN (SELECT *
                      FROM orders
                     WHERE buyer_name='Joe'
                      AND order_no=123
                   ) AS o USING(product_id)

I'm not having a performance issue now, and am not looking to tightly tune this specific query, but to ensure I'm writing queries efficiently in general. Similarly, current DBMS is PostgreSQL 9.4.4 – but looking for generic advice that would apply to most SQL variants.

Best Answer

In general, it's best to leave the specifics of how the data is joined together to the DB engine - usually, there is an optimizer that looks at table/index statistics, and figures out the most efficient way to eliminate rows from consideration. If you're having issues with performance on a specific query, you can sometimes force the engine to use a particular method/option on a given query - but this can lead to good performance in specific cases and bad performance in everything else.

That said, the logic of your query is such that rows from your "right" table should be excluded from consideration if they don't belong to the targeted buyer/order, so they do need to be included as a part of the LEFT JOIN clause. I would rework that as follows:

  FROM products p
         LEFT JOIN orders o ON (    p.product_id = o.product_id
                                AND o.buyer_name = 'Joe'
                                AND o.order_no = 123
                               )

If you simply join the two tables, the DB engine will figure out what columns it needs out of those tables from the rest of the query. I can't think of a good reason to use a sub-query that just selects specific columns from a table in a join, rather than using the table itself directly. I won't say there isn't such a reason, just that I have never encountered one.

Some versions of SQL may materialize a subquery - basically, running the subquery before the main query is evaluated, then evaluating the main query from those results. If you do choose to use a subquery, I'd recommend not using SELECT *, but rather specifying just the actual columns you need in the main query (SELECT product_id, order_qty from orders...), to avoid pulling in data you aren't actually going to use.

That same advice holds for the main query; rather that using SELECT *, specify the actual columns you need. As noted before, the DB engine will figure out what columns it needs from each table by looking at the main query's SELECT list, the WHERE clause, the JOIN criteria, and so forth. If you specify more columns in the SELECT list than you really need, then the engine will give them to you. If each row of data in your table has two varchar fields that average 250 characters each, but all you really need is a varchar(25) field and three integers, you'll be returning a result set that's around 10 times as large as it needs to be (which will take about ten times as much time to reach the machine that made the request).

Limiting your columns to those you actually need also will allowing the DB engine's optimizer to apply certain shortcuts that can improve performance. For example, if your table's rows are (on average) 1000 bytes wide, but you have an index that includes just the columns you need, and those columns only average 100 bytes per row, the optimizer can use that index as if it were the table, and thus may be able to reduce the number of database pages it has to read to get your data by a factor of 10. the term for this is a covering index.

It's also a good idea to assign column names to computed columns - it makes it easier to refer to them elsewhere.

So, we'd end up with the following queries (note that I'm assuming you only will be using a column named product_name in your results; if other columns from products apply, add them to the SELECT list (and to the ORDER BY clause):

For a specific order:

SELECT p.product_name
      ,o.buyer_name
      ,COALESCE(o.quantity,0) as Qty_Ordered
  FROM products p
         LEFT JOIN orders o ON (    p.product_id = o.product_id
                                AND o.buyer_name = 'Joe'
                                AND o.order_no = 123
                               )

For a buyer's total activity:

SELECT p.product_name
      ,o.buyer_name
      ,COALESCE(SUM(o.quantity),0) as Qty_Ordered
  FROM products p
         LEFT JOIN orders o ON (    p.product_id = o.product_id
                                AND o.buyer_name = 'Joe'
                               )
 GROUP BY p.product_name, o.buyer_name

A few final notes:

In some SQL variants, every column that is not in an aggregate function in the SELECT list must appear in the GROUP BY clause. Other versions allow you to leave columns out - the value from a "random" row will be returned. If you've got a unique value from a table in the GROUP BY, it would be safe to leave other columns from that table out if using one of these other versions, as all possible values from the rows being aggregated would have to be the same.

Also, your query makes an assumption: any order will only have one line item for a given product. Now, that seems like a reasonable assumption on the surface; however, I have seen systems where someone might place an order for 10 widgets on the 1st of the month, and an additional 10 on the 15th. If there's any chance that a product might show up twice in the same order, you can use the second query to pull the order-specific information as well; you just need to put AND o.order_no = 123 back in, like in the first query. This would probably take slightly longer than the query without the SUM and the ORDER BY - but I would expect, in most cases, the difference would be too small for a human being to tell.