Mysql – Trying to reduce number of table scans when querying on the same column but different values

MySQLperformancequery-performance

I'm having a mental block. I'm using a sample MySQL customer database…

http://www.mysqltutorial.org/mysql-sample-database.aspx

And would like to query for any customers that bought two products (product codes S18_1589 and S24_1628). I wrote this query…

SELECT customerNumber
FROM orders o
WHERE customerNumber IN (SELECT customerNumber
                         FROM orders o
                         JOIN orderdetails od
                         ON o.orderNumber = od.orderNumber
                         WHERE od.productCode = 'S18_1589')
AND customerNumber IN (SELECT customerNumber
                         FROM orders o
                         JOIN orderdetails od
                         ON o.orderNumber = od.orderNumber
                         WHERE od.productCode = 'S24_1628')
GROUP BY customerNumber
ORDER BY customerNumber;

That gets the results that I want but I'd like to do this without using two sub queries if possible. The customers can buy any other products but they need to have purchased the two mentioned.

In order to try to make the query more efficient, I changed it to this…

SELECT customerNumber
FROM orders o
JOIN orderdetails od
ON o.orderNumber = od.orderNumber
WHERE od.productCode IN ('S24_1628', 'S18_1589')
GROUP BY customerNumber
HAVING COUNT(*) >= 2
ORDER BY customerNumber;

which gives me the same result as the first query but I think it's just a coincidence. I believe the query will also include customers that have purchased only one of the products but 2 or more times.

Anyone know a better way / more efficient way of handling?

NOTE: I was given this question and was told my query was not performant because of the two subqueries but no further feedback was given. It's a different dataset / schema than what I'm using now but size is similar and what I'm trying to query is the same type of question.

Best Answer

Try the following:

SELECT o1.customerNumber
FROM orders o1
INNER JOIN orderdetails od1
ON o1.orderNumber = od1.orderNumber
INNER JOIN orders o2
ON o2.customerNumber = o1.customerNumber
INNER JOIN orderdetails od2
ON o2.orderNumber = od2.orderNumber
WHERE od1.productCode = 'S18_1589'
AND od2.productCode = 'S24_1628'
GROUP BY o1.customerNumber;

Describing the logic in words, this will find all orders for the first product number, then require there also be orders for the same customerNumber that has ordered the second product number. (This logic is NOT meant to imply the order the optimizer will choose joins. It is only the logic used to construct the query, based on the requirements stated.)

I dropped the ORDER BY because GROUP BY includes an order by as part of what it does. Although, the optimizer likely would already know it is ordered, and so it is just syntax difference.

As for whether this is more efficient, I would benchmark it to see. For larger data sets I believe my solution will be better than an OR based one, but again, always benchmark.

Related Solutions

Mysql – Reducing execution time for this gnarly SQL

At the moment, I am using an instance of MySQL 5.6.19 which has the performance_schema enabled on a 2GB x64 Linux laptop (nothing stellar by any means!). My normal response time is approx. 0.5 s - not too bad. What's your setup (RAM, disk config, OS &c.)?

I must say that I'm surprised at 3 seconds on a server. Are you sure that it's not some sort of network issue?

I checked out I/O for the query and noticed something strange - every time the query executes, the order_items table is accessed 27,321 times - which is exactly the same as the number of records. This smells of a correlated subquery - but the SQL is horribly complex and I haven't been able to get to the bottom of it (not enough time). Obviously, you should only want to access the table once (or a few times maybe) at most. Order_status is only read once for example.

It's not really a surprise that your query doesn't use indexes - see this post here which gives some explanation as to why - basically if the optimiser thinks that a FTS (full table scan) is cheaper than using the index, then that's what it will do. This is a function of the size of the table(s) (yours are v. small ===> FTS) and/or percentage of records to be retrieved (yours 100%, again ==> FTS).

I notice that you are using MyISAM tables, for full text indexes? InnoDB now has full text indexes - see here and check the docco. Also (and more importantly), InnoDB allows you to use DRI (declarative referential integrity) which is very important - IMHO, there is no good reason to use MyISAM tables anymore.

Could you explain the business logic behind this query? It strikes me (and this is only an opinion) that it is overly complex for a relatively simple system with 8 tables - has it grown over time and been added to by different people? I am respectfully suggesting that the time may have come for a complete rewrite from scratch rather than tinkering at the edges.

Another problem is that for orders, you have ordered_date as an INT(12) - when I first saw the data, I thought that it was a timestamp. You are creating extra work for yourself if you don't use the correct data type for your columns. Error checking is more difficult.

[EDIT] in response to OP's comments (easy to hard)

INT v TIMESTAMP v DATE/DATETIME

Take a look at these (1, 2, 3, 4) posts. I suppose that INT v TIMESTAMP is relatively trivial (however, see point 4 in the answer here), but INT/TIMESTAMP v DATE/DATETIME isn't. Check out the answer to post 4 - you should be using DATEs/DATETIMEs when you are using INT/TIMESTAMP.

MySQL 5.6.19, but the server is using 5.5.37. Would upgrading improve the performance?

Maybe, maybe not - it's probably trivial in respect to the issues that I'll be going through in point 3 - see InnoDB/Full Text Indexing. If I had to guess, upgrading would decrease performance by no more than a couple of percent (some think that the performance schema adds overhead) and quite possibly be a significant improvement. These things can vary according to many parameters (disk, CPU, RAM, network, server load, phase of the moon...). Overall though, performance should improve release by release although YMMV.

General remarks.

I changed the storage engine for your system from MyISAM to InnoDB. This resulted in the query time being consistently reduced from ~ .45s to ~ .37s (~20%) - a noticeable improvement - maybe you should try it? I can only speak for the testing (not v. thorough) I did on my 2GB x64 Linux laptop. You didn't answer my question about your own server specs. If I'm getting a response time of ~ .4s, I can't for the life of me understand why you're getting 3s? Unless it's (as I suggested) a network issue, or is the server under load when you run the query?

InnoDB now has full text indexing in 5.6 (see my point 2 above) - AFAIK, that will not work in 5.5 (i.e. your table declarations will fail for InnoDB) - another reason to upgrade.

I tried to put in FOREIGN KEY relationships in the tables, but there appears to be problems with your data - I'm getting different counts after reloading the schemas (without FK and with FK). This can mean you have orphaned records, which may or may not affect the result. Anomalies occurred with orders/order_items and customers/customer_addresses. e.g.

mysql> select i.item_id, i.order_id from order_items i where i.order_id not in (SELECT ORDER_ID from orders);
+---------+----------+
| item_id | order_id |
+---------+----------+
|   26892 |     2433 |
+---------+----------+

Worse, none of the order payments appear valid

mysql> select p.order_id from order_payments p where p.order_id not in (select order_id from orders);
+----------+
| order_id |
+----------+
|     2143 |
|     2214 |
|     2216 |
+----------+
3 rows in set (0.00 sec)

mysql> 
mysql> select * from order_payments;
+------------+----------+--------+------------+------------+
| payment_id | order_id | amount | date       | is_deleted |
+------------+----------+--------+------------+------------+
|          1 |     2143 |  99.34 | 1405900800 |          0 |
|          2 |     2216 |   5.09 | 1406505600 |          0 |
|          3 |     2214 |  55.10 | 1406505600 |          0 |
+------------+----------+--------+------------+------------+

I couldn't figure out the other relationships in the time available - maybe we can go over them. It may well be that a solid DB structure with overtly declared FOREIGN KEYs may perform better. As a general rule, the more information you give the optimiser, the better.

The point about the optimiser not using indexes when tables are too small and/or too high a percentage of records is accessed still stands - however, there is no harm in doing things correctly - and as your data volumes change/grow, the optimiser may choose different plans - and the more info, the better.

The orphaned record issue (forgetting about performance for the moment) is another GOOD reason to switch to InnoDB. With proper DRI (declarative referential integrity), you cannot have orphaned records, which can be the source of tricky to find bugs.

Finally, just a few words on the system generally. It seems to me that this query is like a runaway train - it has built up and built up momentum and is now overly complex for purpose. Maybe a step back is required? I got to thinking about triggers and using them to populate other tables (similar to a materialised view) and/or even the use of views themselves. You need to have a system that you are comfortable with and that you can change to suit your needs. Some behemoth sitting in the corner spewing out answers that nobody understands is not very conducive to flexibility, upgrading or peace of mind :-).

I would like to do three things (using InnoDB)

1) get all the data types in order (DATE/DATETIMEs when appropriate)

2) sort out the foreign key relationships (need input on that - please feel free to contact me off list for this - my email is in my profile).

3) break this juggernaut down into its constituent components and explore methods of making it better, faster, easier to understand and modify.

I am unable to spend 09:00 - 18:00 at this, but would like to have a look in the evening if that suits and you are interested.

MySQL – SUM with Multiple GROUP BY on Same Table

I'm more of a SQL Server guy, but here is a stab to get you motivated. You could use the CASE expression to help SUM() orders by certain order types.

For example:

SELECT
    idclient AS [Client ID]
   ,COUNT(value) AS [Total Orders]
   ,SUM(value) AS [Total Order Value]
   ,SUM( CASE WHEN type1 = 'some type' THEN value ELSE 0 END) AS [Order Combo 1]
   ,SUM( CASE WHEN type1 = 'some type' AND type2 = 'another type' THEN value ELSE 0 END) [Order Combo 2]
   ,SUM( CASE WHEN type1 = 'some type' AND type2 = 'another type' AND type3 = 'another type' THEN value ELSE 0 END) [Order Combo 3]
FROM 'test'.'orders'
GROUP BY idclient

This will group the orders on [idclient] and will SUM([value]) the instances where certain order records meet the following conditions or combinations you specified.

So, if a customer has an order where [type1] was Apples and [type2] was Oranges, you can SUM([value]) that was 1 under a new column called something like [Apple/Orange Orders].

The result can also include additional combinations for the same record in multiple fields/columns. If that is a problem, just move the CASE to the GROUP BY to break each combination into a separate record in the the result set.

Hope this helps!

Best Answer

Related Solutions

Mysql – Reducing execution time for this gnarly SQL

MySQL – SUM with Multiple GROUP BY on Same Table

Related Question