TheSQL efficiency with inner join queries

MySQLoptimization

I have two tables, both with about 1,000,000 rows.

Here is the question I am trying to get mySQL to solve:

  • return all record_ids from table_a where that id matches the cat_id in table_b, but only for customer_id 654 (customer_id exists in both tables).

First off I tried this with the following statement:

SELECT record_id from `table_a` where customer_id="654" and record_id in
(SELECT cat_id from `table_b` where cat_id="654");

This is the most logical looking one to me, as it's more at my coding level. But this query took 2 minutes to run. So I tried to re-write it using:

SELECT record_id from `table_a` where 
customer_id = "654" and exists (SELECT cat_id FROM `table_b` where
table_b.customer_id = "654" and
table_a.record_id= table_b.cat_id 
group by record_id having count(*) > 1);

This took 22 seconds to run.

I tried another way as well:

SELECT table_a.record_id from table_a
INNER JOIN table_b ON (table_a.record_id = table_b.cat_id)
WHERE table_a.customer_id = "654" and table_b.customer_id= "654"
GROUP BY table_a.record_id
HAVING COUNT(*) > 1;

This one takes 19.5 seconds to run.

I'm wondering if my third attempt is the most efficient way to ask this question, or as I'm hoping, there is a more efficient way to run these.

Here is mySQL's execution path for the last two queries:

enter image description here

On table_a, I have the following indices:

KEY `rc` (`record_id`,`customer_id`)
KEY `test` (`record_id`,`customer_id`),  
KEY `test2` (`record_id`,`customer_id`),  
KEY `cust` (`customer_id`)

On table_b, I have the following indices:

KEY `rc` (`cat_id, `customer_id`)

There are duplicates that I made on accident and I haven't gotten around to deleting them, (not sure if that impacts performance):

Best Answer

SELECT record_id from `table_a`
where customer_id="654"
and record_id in
    (SELECT cat_id from `table_b` where cat_id="654");

The meaning of a JOIN is the AND of the meanings of its arguments. ON and WHERE both AND in a conditon. You want rows where (using obvious aliases):
    customer [a.customer_id] ...
AND customer [b.cat_id] ...
AND [a.customer_id] = 654 AND [b.cat_id] = 654 AND [a.record_id] = [b.cat_id]

SELECT a.record_id
FROM `table_a` a JOIN `table_b` b
WHERE a.customer_id = 654 AND b.cat_id = 654
AND a.record_id = b.cat_id

(In standard SQL ( INNER ) JOIN needs an ON. So you could CROSS JOIN or replace WHERE by ON.)

As a comment says, MySQL has historically not been very good at optimizing. But it is constantly improving. IN has been notoriously slow, even when it is equivalent to other more optimized expressions. You may get better performance by explicitly equating the ids first in an ON:

SELECT a.record_id
FROM `table_a` a JOIN `table_b` b
ON a.record_id = b.cat_id
AND a.customer_id = 654 AND b.cat_id = 654

Declare each of those fields as PK or UNIQUE NOT NULL if it is (which implicitly adds an index), otherwise add an index on it. MySQL unadorned KEY is a synonym for INDEX when not in a column declaration, which does not tell the database that a column set is unique. Yes, uniqueness affects performance, so give your tables their proper rows and declare and enforce any uniqueness by PRIMARY KEY or UNIQUE.

If you make the ids INT then the DBMS only needs to go to the index, not the data.

(Also, read the documentation re keys, indices and optimization. Use EXPLAIN.)