TheSQL efficiency with inner join queries

MySQLoptimization

I have two tables, both with about 1,000,000 rows.

Here is the question I am trying to get mySQL to solve:

return all record_ids from table_a where that id matches the cat_id in table_b, but only for customer_id 654 (customer_id exists in both tables).

First off I tried this with the following statement:

SELECT record_id from `table_a` where customer_id="654" and record_id in
(SELECT cat_id from `table_b` where cat_id="654");

This is the most logical looking one to me, as it's more at my coding level. But this query took 2 minutes to run. So I tried to re-write it using:

SELECT record_id from `table_a` where 
customer_id = "654" and exists (SELECT cat_id FROM `table_b` where
table_b.customer_id = "654" and
table_a.record_id= table_b.cat_id 
group by record_id having count(*) > 1);

This took 22 seconds to run.

I tried another way as well:

SELECT table_a.record_id from table_a
INNER JOIN table_b ON (table_a.record_id = table_b.cat_id)
WHERE table_a.customer_id = "654" and table_b.customer_id= "654"
GROUP BY table_a.record_id
HAVING COUNT(*) > 1;

This one takes 19.5 seconds to run.

I'm wondering if my third attempt is the most efficient way to ask this question, or as I'm hoping, there is a more efficient way to run these.

Here is mySQL's execution path for the last two queries:

On table_a, I have the following indices:

KEY `rc` (`record_id`,`customer_id`)
KEY `test` (`record_id`,`customer_id`),  
KEY `test2` (`record_id`,`customer_id`),  
KEY `cust` (`customer_id`)

On table_b, I have the following indices:

KEY `rc` (`cat_id, `customer_id`)

There are duplicates that I made on accident and I haven't gotten around to deleting them, (not sure if that impacts performance):

Best Answer

SELECT record_id from `table_a`
where customer_id="654"
and record_id in
    (SELECT cat_id from `table_b` where cat_id="654");

The meaning of a JOIN is the AND of the meanings of its arguments. ON and WHERE both AND in a conditon. You want rows where (using obvious aliases):
customer [a.customer_id] ...
AND customer [b.cat_id] ...
AND [a.customer_id] = 654 AND [b.cat_id] = 654 AND [a.record_id] = [b.cat_id]

SELECT a.record_id
FROM `table_a` a JOIN `table_b` b
WHERE a.customer_id = 654 AND b.cat_id = 654
AND a.record_id = b.cat_id

(In standard SQL ( INNER ) JOIN needs an ON. So you could CROSS JOIN or replace WHERE by ON.)

As a comment says, MySQL has historically not been very good at optimizing. But it is constantly improving. IN has been notoriously slow, even when it is equivalent to other more optimized expressions. You may get better performance by explicitly equating the ids first in an ON:

SELECT a.record_id
FROM `table_a` a JOIN `table_b` b
ON a.record_id = b.cat_id
AND a.customer_id = 654 AND b.cat_id = 654

Declare each of those fields as PK or UNIQUE NOT NULL if it is (which implicitly adds an index), otherwise add an index on it. MySQL unadorned KEY is a synonym for INDEX when not in a column declaration, which does not tell the database that a column set is unique. Yes, uniqueness affects performance, so give your tables their proper rows and declare and enforce any uniqueness by PRIMARY KEY or UNIQUE.

If you make the ids INT then the DBMS only needs to go to the index, not the data.

(Also, read the documentation re keys, indices and optimization. Use EXPLAIN.)

Related Solutions

MySQL optimization – year column grouping – using temporary table, filesort

I don't see a lot of opportunity for improvement.

The index you added was probably a big help, because it's being used for the range matching on the WHERE clause (type => range, key => tran_date), and it's being used as a covering index (extra => using index), avoiding the need to seek into the table to fetch the row data.

But since you're using functions to construct the financial_year value for the group by, both the "using filesort" and "using temporary" can't be avoided. But, those aren't the real problem. The real problem is that you're evaluating MONTH(tran_date) 346,485 times and YEAR(tran_date) at least that many times... ~700,000 function calls in one second doesn't seem too bad.

Plan B: I am definitely not a fan of storing redundant data, and I'm dead-set against making the application responsible for maintaining it... but one option I might be tempted to try would be to create a dashboard_stats_by_financial_year table, and use after-insert/update/delete triggers on the transactions1 table to manage keeping those stats current.

That option has a cost, of course -- adding to the amount of time it takes to update/insert/delete a transaction... but, waiting > 1200 milliseconds for stats for your dashboard is a cost, too. So it may come down to whether you want to pay for it now or pay for it later.

MySQL server taking multiple seconds to run COUNT(*) query using an index on a MyISAM table

The problem appears to have been that the cardinality of my indices was NULL. I am unsure why the indices would revert to this state when they were apparently fine yesterday.

I discovered this problem by running SHOW INDEX FROM tickets.

I ran ANALYZE TABLE tickets. Now the EXPLAIN looks subtly different-- the rows have dropped to 1:

mysql> explain select count(*) from tickets WHERE id = 214;
|  1 | SIMPLE      | tickets | const | PRIMARY       | PRIMARY | 2       | const |    1 | Using index |

Best Answer

Related Solutions

MySQL optimization – year column grouping – using temporary table, filesort

MySQL server taking multiple seconds to run COUNT(*) query using an index on a MyISAM table

Related Question