MySQL uses INDEX for optimizing ORDER BY with GROUP BY except when you add a JOIN

indexjoin;mysql-5.7order-by

Question

Why can't Query #2 use the same (car_trims.horsepower_peak) index to optimize the sorting of the rows as Query #1? The only difference between the two queries is the addition of the JOIN in Query #2.

car_trims ~50k rows

PK: (car_trims.id), index on (car_trims.horsepower_peak)

car_makes ~100 rows

PK: (car_makes.id)

Query #1

SELECT car_trims.*
FROM car_trims
GROUP BY car_trims.id
ORDER BY car_trims.horsepower_peak DESC
LIMIT 0, 200

Execution time: .0026 seconds

EXPLAIN:

Query #2

SELECT car_trims.*
FROM car_trims
STRAIGHT_JOIN car_makes ON car_makes.id = car_trims.make_id
GROUP BY car_trims.id
ORDER BY car_trims.horsepower_peak DESC
LIMIT 0, 200

Execution time: .2533 seconds

EXPLAIN:

UPDATE:

I've been continuing to work on this and I believe the index is not being utilized in Query #2 because of the mixing of GROUP BY and ORDER BY. According to the MySQL Docs,

"In some cases, MySQL cannot use indexes to resolve the ORDER BY …
[for example, when] … the query has different ORDER BY and GROUP BY
expressions."

Query #1 does mix GROUP BY and ORDER BY and so theoretically the index should not be used according to the docs but I believe that may not apply if the GROUP BY is being ignored entirely due to only 1 table being selected from and grouping on the primary key.

Also, my actual original query is not quite as simple as the example provided here. The crucial difference: usage of GROUP_CONCAT in the SELECT requiring the aforementioned GROUP BY in order to prevent grouping on all rows (i.e. getting a 1-row result). The solution to that issue is using a DEPENDENT SUBQUERY, as discussed here: https://stackoverflow.com/questions/7381828/indexing-with-group-by-order-by-and-group-concat

Best Answer

Query 1: Since id is the PRIMARY KEY, it is unique. Hence the GROUP BY id does nothing. Remove it. This may make it run faster.

Query 2 does not use any columns other than id from car_makes. The only thing that the JOIN does is to verify that there is a row in car_makes for the make_id. You probably don't need that check, so get rid of car_makesin that query. That will simplify things. Note that currently there is a "filesort". Withoutcar_makes`, that step will probably go away.

As for "why can't it use the same index" -- The STRAIGHT_JOIN forces it to look at the other table first. This effectively turns the second table into

WHERE     make_id = ...
GROUP BY  id
ORDER BY  horsepower_peak DESC

To optimize such, it must first filter on make_id.

Related Solutions

Mysql – Slow complex query with group/order

I can see couple things that should improve your query performance.

1 As you already found out there is absolutely no need to join mentioncache. Using EXISTS seems more natural (or IN as you did, but EXISTS may work better from performance point of view).

2 DATE(m.indexed) BETWEEN "2012-09-16" AND "2012-10-16" can be rewritten to m.indexed between "2012-09-16" AND "2012-10-16 23:59:59", so mysql can use index.

3 urlinfluranks doesn't seem to be used anywhere except in LEFT JOIN, why do you need it?

4 f.foreign_id can be either null or m.id, and this is the only reference to favoureditems table, I'd rather use subquery in this case.

Finally, I think you can get the same results without GROUP BY m.id (as far as I understood , mentions.id a primary key).

SELECT   
m.id, m.title, m.title_text, m.content_text, m.url,m.root_url,m.sub_type,m.indexed,  
CASE 
 WHEN EXISTS 
    (SELECT NULL FROM favoureditems f WHERE f.model = "Mention" 
    AND f.foreign_id = m.id AND f.owner_id = 803) THEN m.id 
END AS f.foreign_id,
, v.foreign_id, v.created, mfs.score,  
Image.id,Image.model,Image.foreign_key, Image.dirname,Image.basename,  
(REPLACE(REPLACE(m.host_url, 'http://www.', ''), 'http://', '')) AS Mention__plain_url  
FROM mentions AS m  

LEFT JOIN 
(
  SELECT id,model,foreign_key,dirname,basename 
  FROM attachments Image  
  WHERE model = 'Mention'
  GROUP BY foreign_key
 )Image  ON (Image.foreign_key = m.id)      

LEFT JOIN 
(
   SELECT v.foreign_id, v.created 
   FROM visiteditems AS v  
   WHERE (v.model = "Mention"  AND v.owner_id = 803)  
    GROUP BY v.foreign_id
)v ON (v.foreign_id = m.id)
LEFT JOIN 
(
   SELECT mention_id,score
   FROM mentionfeedscores mfs  
   WHERE mfs.feed_id = '474737584865424564398208323289092'
   GROUP BY mention_id
)mfs ON (mfs.mention_id = m.id )

WHERE m.indexed BETWEEN "2012-09-16" AND "2012-10-16 23:59:59"  
   AND EXISTS 
  (
     SELECT NULL FROM mentioncache mc  
      WHERE mc.mention_id = m.id AND mc.profile_id = 803  
   )    
ORDER BY m.indexed DESC  
LIMIT 10

Mysql – Optimizing ORDER BY for simple MySQL query

The EXPLAIN SELECT you posted definitely seems counter-intuitive.

If your query included WHERE s.id = ... then the query plan you're seeing might make a little bit more sense, but I'm assuming you're not.

It looks like the optimizer is getting distracted by the facts that supplier is a smaller table and that the supplier_id index in the po table can be used as a covering index... and with those facts in hand, it's overlooking the seemingly-obvious fact that the tables should be read in the opposite order than the one it chooses.

Here are two alternatives.

-- use the STRAIGHT_JOIN directive to insist that the optimizer process the tables in only the listed order:

SELECT STRAIGHT_JOIN * FROM `po` 
INNER JOIN po_suppliers s ON po.supplier_id = s.id
ORDER BY po.id ASC
LIMIT 10;

-- use the FORCE KEY index hint to direct the optimizer to prefer the primary key of the po table:

SELECT * FROM `po` FORCE KEY (PRIMARY) 
INNER JOIN po_suppliers s ON po.supplier_id = s.id
ORDER BY po.id ASC
LIMIT 10;

The first option is probably the better option, since FORCE KEY, in spite of the name, is still only a "hint" that the optimizer can choose to ignore, while STRAIGHT_JOIN genuinely does force the hand of the optimizer to join the tables in the order they're listed.