Mysql – Why is a query with a join and limit clause running so slow, even using a covering index

indexMySQLoptimization

I've got two tables that are a 1 to 1 relation.

Table A has an id column, and table B includes foreign key A_id to table A, and a nullable value column. I have a composite index on [value, A_id] on table B, and I'm running the following query:

SELECT B.A_id FROM B JOIN A ON B.A_id = A.id WHERE B.value IS NULL LIMIT 1000;

First of all, I understand the JOIN doesn't seem to serve a purpose here, I'm just illustrating the problem. In reality I need to select one other column from table A. If I get rid of the join, the query is instant.

There are 10 million rows in table B. This query takes more than 50 seconds to run. Yet the explain plan shows that table B is being accessed first, and is using the composite index. "Extra" shows "Using where; Using index". "Rows" shows almost 5 million rows.

What's going on here, and why doesn't the LIMIT cause this to run as nearly as quickly as without the join?

Best Answer

Even if both tables have the appropriate indexes on A.id and A_id you still may need an index on B.value to get decent performance. I've reformatted your query for readability.

SELECT  B.A_id 
FROM    B 
JOIN    A 
ON      B.A_id = A.id 
WHERE   B.value IS NULL 
LIMIT   1000;

It appears that A_id is a foreign key for a many to one relationship from B to A. If that is the case we don't need to reference table A as A_id will be NOT NULL if the tables would join.

SELECT  B.A_id 
FROM    B 
WHERE   B.value IS NULL 
AND     A_id IS NOT NULL
LIMIT   1000;

If B.value is rarely NULL you may not 1000 records in B that where B.value is NULL. You can count the records, but not get a distribution of the records with the query.

SELECT  COUNT(1) 
FROM    B 
WHERE   B.value IS NULL;

An index on B.value would likely speed up the query. A limit of 1000 is significantly larger than I have used on a production system. Try the query with smaller limits like 1, 2, 5, 10, 20, 50 or 100.

Problems with query performance are best investigated with an EXPLAIN PLAN of the query. This will tell you how the database is executing the query. I've seen the optimizer choose rather strange plans, usually providing better performance than the plan I expected. Rarely, the optimizer will choose a really bad plan. Without the right indexes and statistics, the optimizer may not be able to generate a plan the performs well.