I have two tables, both with about 1,000,000 rows.
Here is the question I am trying to get mySQL to solve:
- return all record_ids from table_a where that id matches the cat_id in table_b, but only for customer_id 654 (customer_id exists in both tables).
First off I tried this with the following statement:
SELECT record_id from `table_a` where customer_id="654" and record_id in
(SELECT cat_id from `table_b` where cat_id="654");
This is the most logical looking one to me, as it's more at my coding level. But this query took 2 minutes to run. So I tried to re-write it using:
SELECT record_id from `table_a` where
customer_id = "654" and exists (SELECT cat_id FROM `table_b` where
table_b.customer_id = "654" and
table_a.record_id= table_b.cat_id
group by record_id having count(*) > 1);
This took 22 seconds to run.
I tried another way as well:
SELECT table_a.record_id from table_a
INNER JOIN table_b ON (table_a.record_id = table_b.cat_id)
WHERE table_a.customer_id = "654" and table_b.customer_id= "654"
GROUP BY table_a.record_id
HAVING COUNT(*) > 1;
This one takes 19.5 seconds to run.
I'm wondering if my third attempt is the most efficient way to ask this question, or as I'm hoping, there is a more efficient way to run these.
Here is mySQL's execution path for the last two queries:
On table_a, I have the following indices:
KEY `rc` (`record_id`,`customer_id`)
KEY `test` (`record_id`,`customer_id`),
KEY `test2` (`record_id`,`customer_id`),
KEY `cust` (`customer_id`)
On table_b, I have the following indices:
KEY `rc` (`cat_id, `customer_id`)
There are duplicates that I made on accident and I haven't gotten around to deleting them, (not sure if that impacts performance):
Best Answer
The meaning of a
JOIN
is the AND of the meanings of its arguments.ON
andWHERE
both AND in a conditon. You want rows where (using obvious aliases):customer [a.customer_id] ...
AND customer [b.cat_id] ...
AND [a.customer_id] = 654 AND [b.cat_id] = 654 AND [a.record_id] = [b.cat_id]
(In standard SQL (
INNER
)JOIN
needs anON
. So you couldCROSS JOIN
or replaceWHERE
byON
.)As a comment says, MySQL has historically not been very good at optimizing. But it is constantly improving.
IN
has been notoriously slow, even when it is equivalent to other more optimized expressions. You may get better performance by explicitly equating theid
s first in anON
:Declare each of those fields as
PK
orUNIQUE NOT NULL
if it is (which implicitly adds an index), otherwise add an index on it. MySQL unadornedKEY
is a synonym forINDEX
when not in a column declaration, which does not tell the database that a column set is unique. Yes, uniqueness affects performance, so give your tables their proper rows and declare and enforce any uniqueness byPRIMARY KEY
orUNIQUE
.If you make the
id
sINT
then the DBMS only needs to go to the index, not the data.(Also, read the documentation re keys, indices and optimization. Use
EXPLAIN
.)