Mysql – Trying to reduce number of table scans when querying on the same column but different values

MySQLperformancequery-performance

I'm having a mental block. I'm using a sample MySQL customer database…

http://www.mysqltutorial.org/mysql-sample-database.aspx

And would like to query for any customers that bought two products (product codes S18_1589 and S24_1628). I wrote this query…

SELECT customerNumber
FROM orders o
WHERE customerNumber IN (SELECT customerNumber
                         FROM orders o
                         JOIN orderdetails od
                         ON o.orderNumber = od.orderNumber
                         WHERE od.productCode = 'S18_1589')
AND customerNumber IN (SELECT customerNumber
                         FROM orders o
                         JOIN orderdetails od
                         ON o.orderNumber = od.orderNumber
                         WHERE od.productCode = 'S24_1628')
GROUP BY customerNumber
ORDER BY customerNumber;

That gets the results that I want but I'd like to do this without using two sub queries if possible. The customers can buy any other products but they need to have purchased the two mentioned.

In order to try to make the query more efficient, I changed it to this…

SELECT customerNumber
FROM orders o
JOIN orderdetails od
ON o.orderNumber = od.orderNumber
WHERE od.productCode IN ('S24_1628', 'S18_1589')
GROUP BY customerNumber
HAVING COUNT(*) >= 2
ORDER BY customerNumber;

which gives me the same result as the first query but I think it's just a coincidence. I believe the query will also include customers that have purchased only one of the products but 2 or more times.

Anyone know a better way / more efficient way of handling?

NOTE: I was given this question and was told my query was not performant because of the two subqueries but no further feedback was given. It's a different dataset / schema than what I'm using now but size is similar and what I'm trying to query is the same type of question.

Best Answer

Try the following:

SELECT o1.customerNumber
FROM orders o1
INNER JOIN orderdetails od1
ON o1.orderNumber = od1.orderNumber
INNER JOIN orders o2
ON o2.customerNumber = o1.customerNumber
INNER JOIN orderdetails od2
ON o2.orderNumber = od2.orderNumber
WHERE od1.productCode = 'S18_1589'
AND od2.productCode = 'S24_1628'
GROUP BY o1.customerNumber;

Describing the logic in words, this will find all orders for the first product number, then require there also be orders for the same customerNumber that has ordered the second product number. (This logic is NOT meant to imply the order the optimizer will choose joins. It is only the logic used to construct the query, based on the requirements stated.)

I dropped the ORDER BY because GROUP BY includes an order by as part of what it does. Although, the optimizer likely would already know it is ordered, and so it is just syntax difference.

As for whether this is more efficient, I would benchmark it to see. For larger data sets I believe my solution will be better than an OR based one, but again, always benchmark.