Mysql – Many-to-many tag filter query with a junction table get slower with more tags. Can it be faster

MySQLmysql-5.7performancequery-performanceself-join

Part 1

I have this query to find some videos containing all given tags. Here with tags 1,9,27,13,67 as an example.

SELECT *
FROM video v
WHERE v.id IN (
    SELECT tr1.vid as video_id FROM (SELECT video_id vid FROM tag_rel WHERE tag_id=1) tr1
    INNER JOIN (SELECT video_id vid FROM tag_rel WHERE tag_id=9) tr2 USING (vid)
    INNER JOIN (SELECT video_id vid FROM tag_rel WHERE tag_id=27) tr3 USING (vid)
    INNER JOIN (SELECT video_id vid FROM tag_rel WHERE tag_id=13) tr4 USING (vid)
    INNER JOIN (SELECT video_id vid FROM tag_rel WHERE tag_id=67) tr5 USING (vid)
    -- more joins if more tags -- 
)
AND is_x=0
AND is_y=0
AND deleted IS NULL
LIMIT 0, 60

It executes in ~1.6 seconds.

If I run the inner select alone, it executes in ~0.8 seconds, and the result is 77 video ids.

Now, if I replace the inner select with these (example) video ids:

SELECT *
FROM video v
WHERE v.id IN (
    1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77
)
AND is_x=0
AND is_y=0
AND deleted IS NULL
LIMIT 0, 60

This query executes in ~0.06 seconds.

Question 1: Why is this the case, and why doesn't the full query run in ~0.86 seconds, which is the sum of the two operations (get video ids by tags and get videos by video ids)?

Part 2

For my application, I can't inject the ids because the inner select might return millions of ids. So I need them in one query, together with the support of checking is_x, is_y and deleted, and offsetting the limit for pagination.

I have spent months going back and forth, and this is the fastest query for most cases.

Question 2: Can this whole query be done faster/better with other techniques?

Question 3: By adding more tags, the query gets drastically slower, but isn't more joins just filtering down the number of possible results for the engine, so it should be faster with more tags?

Info

tag_rel contains about half a billion rows with about .5 million unique tag ids. The video table has around 30 million rows.

Full query explain:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| id | select_type | table   | partitions | type   | possible_keys            | key     | key_len | ref                           | rows    | filtered | Extra       |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  1 | SIMPLE      | tag_rel | NULL       | ref    | tag_id,video_id          | tag_id  | 4       | const                         | 1847596 |   100.00 | Using index |
|  1 | SIMPLE      | tag_rel | NULL       | eq_ref | tag_id,video_id          | tag_id  | 8       | const,db.tag_rel.video_id     |       1 |   100.00 | Using index |
|  1 | SIMPLE      | tag_rel | NULL       | eq_ref | tag_id,video_id          | tag_id  | 8       | const,db.tag_rel.video_id     |       1 |   100.00 | Using index |
|  1 | SIMPLE      | tag_rel | NULL       | eq_ref | tag_id,video_id          | tag_id  | 8       | const,db.tag_rel.video_id     |       1 |   100.00 | Using index |
|  1 | SIMPLE      | tag_rel | NULL       | eq_ref | tag_id,video_id          | tag_id  | 8       | const,db.tag_rel.video_id     |       1 |   100.00 | Using index |
|  1 | SIMPLE      | v       | NULL       | eq_ref | PRIMARY,is_x_2,deleted   | PRIMARY | 4       | db.tag_rel.video_id           |       1 |    50.00 | Using where |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Indexes:

tag_rel.tag_id      = unique(tag_id, video_id)
tag_rel.video_id    = unique(video_id, tag_id)
video.is_x_2        = (is_x, is_y, deleted)

Best Answer

I don't get why the subqueries are doubly nested. That combined with the x IN <subquery> structure - which is known to be not well optimized in older versions of MySQL - is probably the issue. Even though you use 5.7, where there have been improvements on the optimizer, I'd prefer a JOIN over an IN (SELECT ..), especially if the SELECT is quite complex.

Suggestion: try to rewrite without any subqueries / derived tables.

SELECT v.*
FROM video AS v
   JOIN tag_rel AS tr1 
     ON tr1.tag_id = 1  AND tr1.video_id = v.id
   JOIN tag_rel AS tr2 
     ON tr2.tag_id = 9  AND tr2.video_id = v.id
   JOIN tag_rel AS tr3 
     ON tr3.tag_id = 27 AND tr3.video_id = v.id
   JOIN tag_rel AS tr4 
     ON tr4.tag_id = 13 AND tr4.video_id = v.id
   JOIN tag_rel AS tr5 
     ON tr5.tag_id = 67 AND tr5.video_id = v.id
    -- more joins if more tags -- 
WHERE
    v.is_x = 0
AND v.is_y = 0
AND v.deleted IS NULL
LIMIT 60 ;

Add indexes (no need, it seems you already have these):

  • a (unique) index on tag_rel (tag_id, video_id)
  • an index on video (deleted, is_x, is_y, id)