Mysql – How to restructure this slow query containing subquery

MySQLoptimizationquery

This query takes about 2-6 seconds to run depending which tag id(s) is/are selected:

SELECT t2.id, t2.name, count(*) as kpl FROM
    (SELECT pt1.post_id
    FROM posts_tags pt1
    WHERE pt1.tag_id IN (88, 5)
    GROUP BY pt1.post_id
    HAVING COUNT(DISTINCT pt1.tag_id) = 2) Matchingposts
INNER JOIN posts_tags pt2 ON (Matchingposts.post_id = pt2.post_id)
INNER JOIN tags t2 ON (pt2.tag_id = t2.id)
WHERE pt2.tag_id NOT IN (88, 5)
GROUP BY pt2.tag_id
HAVING count(*) > 10;

This query gives list of tags available to filter posts further. There can be 1 to n tags selected but the problem is biggest when there are only 1 selected.

Here is EXPLAIN:

+----+-------------+------------+--------+-------------------------------------+------------+---------+--------------------------------+-------+---------------------------------+
| id | select_type | table      | type   | possible_keys                       | key        | key_len | ref                            | rows  | Extra                           |
+----+-------------+------------+--------+-------------------------------------+------------+---------+--------------------------------+-------+---------------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                                | NULL       | NULL    | NULL                           | 13583 | Using temporary; Using filesort |
|  1 | PRIMARY     | pt2        | ref    | post_id,tag_id,post_id_2,tag_id_2   | post_id_2  | 5       | Matchingposts.post_id          |     9 | Using where; Using index        |
|  1 | PRIMARY     | t2         | eq_ref | PRIMARY                             | PRIMARY    | 4       | database.pt2.tag_id            |     1 |                                 |
|  2 | DERIVED     | pt1        | ref    | tag_id,tag_id_2                     | tag_id_2   | 5       |                                | 11586 | Using where; Using index        |
+----+-------------+------------+--------+-------------------------------------+------------+---------+--------------------------------+-------+---------------------------------+
4 rows in set (0.00 sec)

The question is: is there better way to structure this query? And if there isn't: How to config MySQL to keep the tmp table in memory?

In future I would like to use ORDER BY t2.name but I have left that out for now.

EDIT: Tried so far:

My original query: 0,9299 s resultset 379 rows

ypercupe's query: 1,1457 s resultset 379 rows

Walter Mitty's: 1,3014 s resultset 356 rows

EXPLAIN for both Walter Mitty's and ypercupe's query were exactly same:

+----+-------------+-------+--------+-------------------------------------+------------+---------+------------------------------------------+-------+-----------------------------------------------------------+
| id | select_type | table | type   | possible_keys                       | key        | key_len | ref                                      | rows  | Extra                                                     |
+----+-------------+-------+--------+-------------------------------------+------------+---------+------------------------------------------+-------+-----------------------------------------------------------+
|  1 | SIMPLE      | pt_88 | ref    | post_id,tag_id,post_id_2,tag_id_2   | tag_id_2   | 5       | const                                    | 19071 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | pt_5  | ref    | post_id,tag_id,post_id_2,tag_id_2   | post_id_2  | 10      | database.pt_88.post_id,const             |     1 | Using where; Using index                                  |
|  1 | SIMPLE      | pt2   | ref    | post_id,tag_id,post_id_2,tag_id_2   | post_id_2  | 5       | database.pt_5.post_id                    |     9 | Using where; Using index                                  |
|  1 | SIMPLE      | t2    | eq_ref | PRIMARY                             | PRIMARY    | 4       | database.pt2.tag_id                      |     1 |                                                           |
+----+-------------+-------+--------+-------------------------------------+------------+---------+------------------------------------------+-------+-----------------------------------------------------------+
4 rows in set (0.00 sec)

You asked about indexes in posts_tags table. I have put all possible indexes there so MySQL can pick most suitable.

And yes. post_id tag_id combinations should be unique.

EDIT 2: SHOW CREATE TABLEs

 CREATE TABLE `posts_tags` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `post_id` int(11) DEFAULT NULL,
  `tag_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `post_id` (`post_id`),
  KEY `tag_id` (`tag_id`),
  KEY `post_id_2` (`post_id`,`tag_id`),
  KEY `tag_id_2` (`tag_id`,`post_id`)
) ENGINE=MyISAM AUTO_INCREMENT=4556133 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

CREATE TABLE `tags` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `name` (`name`)
) ENGINE=MyISAM AUTO_INCREMENT=44433 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

EDIT 3: Rolando's query:

I killed it after 60 s. State was "copying to tmp table".

Here is EXPLAIN:

+----+-------------+-------------+--------+-----------------+------------+---------+------------+---------+---------------------------------+
| id | select_type | table       | type   | possible_keys   | key        | key_len | ref        | rows    | Extra                           |
+----+-------------+-------------+--------+-----------------+------------+---------+------------+---------+---------------------------------+
|  1 | PRIMARY     | <derived2>  | ALL    | NULL            | NULL       | NULL    | NULL       |    6443 | Using temporary; Using filesort |
|  1 | PRIMARY     | <derived3>  | ALL    | NULL            | NULL       | NULL    | NULL       | 4466982 | Using where; Using join buffer  |
|  1 | PRIMARY     | t2          | eq_ref | PRIMARY         | PRIMARY    | 4       | pt2.tag_id |       1 |                                 |
|  3 | DERIVED     | posts_tags  | index  | tag_id,tag_id_2 | post_id_2  | 10      | NULL       | 4561828 | Using where; Using index        |
|  2 | DERIVED     | pt1         | index  | tag_id,tag_id_2 | post_id_2  | 10      | NULL       | 4561828 | Using index                     |
+----+-------------+-------------+--------+-----------------+------------+---------+------------+---------+---------------------------------+
5 rows in set (5.15 sec)

EDIT 4: Outputs as Rolando requested:

mysql> SELECT COUNT(DISTINCT post_id) postcount FROM posts_tags;
+-----------+
| postcount |
+-----------+
|    515935 |
+-----------+
1 row in set (10.57 sec)

mysql> SELECT COUNT(DISTINCT tag_id) tagcount FROM posts_tags;
+----------+
| tagcount |
+----------+
|    45078 |
+----------+
1 row in set (8.25 sec)

mysql> SELECT tag_id,COUNT(1) postcount FROM  posts_tags GROUP BY tag_id INTO OUTFILE '/tmp/output.txt';
Query OK, 45083 rows affected (1.84 sec)
http://pastebin.com/1VRW0UEr

mysql> SELECT post_id,COUNT(1) tagcount FROM posts_tags GROUP BY post_id INTO OUTFILE '/tmp/output2.txt';
Query OK, 516247 rows affected (2.29 sec)
http://koti.mbnet.fi/photos/output2.txt

Updated question:

The problem is biggest when only one of common used tag is selected. In such case we need to check what different tags 50000 posts have.

Can we agree that tmp table is inevitable in any case?

Best Answer

Is it safe to assume that for every relevant post the tag 88 occurs exactly once and the tag 5 also occurs exactly once?

Is there an index on post_id in table post_tags?

If the answer is yes to both, something like this might work:

select
      t1.id,
      t1.name,
      count (*) - 2 as kpl
from
              post_tags pt1
   inner join post_tags pt2 on (pt2.post_id = pt1.post_id)
   inner join post_tags pt3 on (pt3.post_id = pt1.post_id)
   inner join tags t1 on (t1.id = pt3.tag_id)
where
         pt1.tag_id = 88
    and  pt2.tag_id = 5
group by
   t1.id,
   t1.name
having
   count (*) > 12;