MySQL – Subquery with IN Operator Optimization

MySQLperformancequery-performancesubquery

I have a MySQL query in which I try to get articles with keywords matching those of a given page and that are not already posted to it.

The db tables are :

items : Articles table.
features : Keywords table.
item_features : Pivot table containing article keywords.
page_items : Page articles table : articles already posted to a page.

My query is as follows:

SELECT suggested_items.*
FROM (
       -- select items that share at least one keyword with the page
       -- and count the matching keywords ...
       SELECT
         items.*,
         COUNT(items.id) AS matching_count
       FROM items
         INNER JOIN item_features ON item_features.item_id = items.id
       WHERE
         feature_id IN (21, 654, 2522, 2226, 645, 9854, 65, 4897)
       GROUP BY items.id
     ) suggested_items

  -- Remove all articles that were already suggested to the page...
  LEFT JOIN page_items ON suggested_items.id = page_items.item_id AND page_items.page_id = 654
WHERE
   page_items.id IS NULL

-- order by the number of matching keywords ....
ORDER BY suggested_items.matching_count DESC;

Note : The feature_id in the IN operator of subquery can receive up to
70 keyword id.

Query Explain :

   +----+-------------+---------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------------------------------+------+----------------------------------------------+   
| id | select_type | table         | type   | possible_keys                                                                                                                             | key                                     | key_len | ref                          | rows | Extra                                        |   
+----+-------------+---------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------------------------------+------+----------------------------------------------+   
|  1 | PRIMARY     | <derived2>    | ALL    | NULL                                                                                                                                      | NULL                                    | NULL    | NULL                         |  216 | Using filesort                               |   
|  1 | PRIMARY     | page_items    | eq_ref | page_items_page_id_item_id_unique_index,page_items_page_id_fk,page_items_item_id_fk,page_items_page_id_item_id_date_queued_for_post_index | page_items_page_id_item_id_unique_index | 8       | const,suggested_items.id     |    1 | Using where; Using index; Not exists         |   
|  2 | DERIVED     | item_features | range  | item_and_featureId_unicity,item_features__item_id_fk,item_features__feature_id_fk                                                         | item_features__feature_id_fk            | 4       | NULL                         |  222 | Using where; Using temporary; Using filesort |   
|  2 | DERIVED     | items         | eq_ref | PRIMARY                                                                                                                                   | PRIMARY                                 | 4       | expulp.item_features.item_id |    1 |                                              |   
+----+-------------+---------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------------------------------+------+----------------------------------------------+  

The query is always present in the top of mysql slow-queries log and it takes about 60s.

I'm looking for a way to optimize this query, maybe in rewriting it to 2 queries instead or something similar, without the need of using the subquery.

UPDATE

SHOW CREATE TABLE for item_features :

CREATE TABLE `item_features` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `item_id` int(11) NOT NULL,
  `feature_id` int(11) NOT NULL,
  `occurrence` int(11) NOT NULL DEFAULT '1',
  `created_date` datetime DEFAULT NULL,
  `modified_date` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `item_and_featureId_unicity` (`item_id`,`feature_id`),
  KEY `item_features__feature_id_fk` (`feature_id`)
) ENGINE=InnoDB AUTO_INCREMENT=16401308 DEFAULT CHARSET=utf8

SHOW CREATE TABLE for items.

CREATE TABLE `items` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` varchar(2084) DEFAULT NULL,
  `language` varchar(8) DEFAULT 'en',
  `shortid` varchar(10) DEFAULT NULL,
  `title` varchar(256) NOT NULL,
  `stream_id` int(11) DEFAULT NULL,
  `source_str` varchar(255) DEFAULT NULL,
  `score` int(10) unsigned DEFAULT '0',
  `categories` longtext,
  `tags` longtext,
  `summary` longtext,
  `content` longtext NOT NULL,
  `curator_comment` longtext,
  `slug` varchar(100) DEFAULT NULL,
  `image` longtext,
  `media` int(11) DEFAULT NULL,
  `raw_media` longtext,
  `local_image` longtext,
  `format` char(10) DEFAULT NULL,
  `type` char(8) DEFAULT NULL,
  `date_retrieved` datetime DEFAULT NULL,
  `date_authored` datetime NOT NULL,
  `date_published` datetime DEFAULT NULL,
  `data` longtext,
  `virality` double NOT NULL DEFAULT '0',
  `virality_json` longtext,
  `virality_engagements` text,
  `virality_blocked` tinyint(4) NOT NULL DEFAULT '0',
  `virality_schedual` int(11) DEFAULT NULL,
  `author` varchar(255) DEFAULT NULL,
  `article_html` text,
  PRIMARY KEY (`id`),
  KEY `items_date_retrieved_index` (`date_retrieved`),
  KEY `items_url_unique_index` (`url`(255)),
  KEY `items_media_index` (`media`),
  KEY `items_virality_blocked_url_index` (`virality_blocked`,`url`(255)),
  KEY `items_stream_id_fk_date_retrieved_index` (`stream_id`,`date_retrieved`)
) ENGINE=InnoDB AUTO_INCREMENT=1106025 DEFAULT CHARSET=utf8

SHOW CREATE TABLE for page_items;

CREATE TABLE `page_items` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `item_id` int(11) NOT NULL,
  `page_id` int(11) NOT NULL,
  `shortid` varchar(10) DEFAULT NULL,
  `columns` int(11) DEFAULT '1',
  `categories` longtext,
  `curator_summary` longtext,
  `type` char(7) DEFAULT NULL,
  `date_scheduled` datetime DEFAULT NULL,
  `user_cancelled` tinyint(4) DEFAULT '0',
  `direct_url` varchar(2084) DEFAULT NULL,
  `date_queued_for_post` datetime DEFAULT NULL,
  `approval_pending` tinyint(4) DEFAULT '0',
  `title` varchar(256) NOT NULL,
  `views` int(10) unsigned DEFAULT '0',
  `tags` longtext,
  `content` longtext NOT NULL,
  `curator_comment` longtext,
  `image` longtext,
  `media` int(11) DEFAULT NULL,
  `slug` varchar(100) DEFAULT NULL,
  `format` enum('extracted','redirect','container') DEFAULT 'container',
  `allow_comments` tinyint(4) DEFAULT '0',
  `is_featured` tinyint(4) DEFAULT '0',
  `likes` int(10) unsigned DEFAULT '0',
  `dislikes` int(10) unsigned DEFAULT '0',
  `date_published` datetime DEFAULT NULL,
  `deleted` tinyint(4) NOT NULL DEFAULT '0',
  `date_deleted` datetime DEFAULT NULL,
  `language` varchar(8) DEFAULT 'en',
  `collected_from` int(11) DEFAULT NULL,
  `seen` tinyint(4) NOT NULL DEFAULT '0',
  `data` longtext,
  `suggested` tinyint(4) DEFAULT '0',
  `score` float DEFAULT '0' COMMENT 'Recency_factor x Topics_match_ratio * Feature_Match_Count * Url_Virality_Score.',
  `status` enum('draft','published','queued','approved','scheduled','suggested','rejected') DEFAULT 'draft',
  PRIMARY KEY (`id`),
  UNIQUE KEY `page_items_page_id_item_id_unique_index` (`page_id`,`item_id`),
  KEY `page_items_page_id_fk` (`page_id`),
  KEY `page_items_type_index` (`type`),
  KEY `page_items_item_id_fk` (`item_id`),
  KEY `page_items_slug_index` (`slug`),
  KEY `page_items_shortid_index` (`shortid`),
  KEY `page_items_format_index` (`format`),
  KEY `page_items_date_published_index` (`date_published`),
  KEY `page_items_date_scheduled_index` (`date_scheduled`),
  KEY `page_items_date_queued_for_post_index` (`date_queued_for_post`),
  KEY `page_items_collected_from_index` (`collected_from`),
  KEY `page_items_suggested_from_index` (`suggested`),
  KEY `page_items_deleted_from_index` (`deleted`),
  KEY `page_items_user_cancelled_from_index` (`user_cancelled`),
  KEY `page_items_is_featured_from_index` (`is_featured`),
  KEY `page_items_media_id_index` (`media`),
  KEY `page_items_status_index` (`status`),
  KEY `page_items_page_id_item_id_date_queued_for_post_index` (`page_id`,`item_id`,`date_queued_for_post`),
  KEY `page_items_date_queued_for_post_status_type_index` (`date_queued_for_post`,`status`,`type`),
  KEY `page_items_date_queued_for_post_type_index` (`date_queued_for_post`,`type`)
) ENGINE=InnoDB AUTO_INCREMENT=3249661 DEFAULT CHARSET=utf8

Best Answer

One speedup is to get rid of id in the many:many table items_features. Then promote the UNIQUE key to PRIMARY. 7 tips on optimizing many:many table.

Do you need all of items.*? In particular, if you don't need the text fields, things will run faster by not having to retrieve and lug them through the tmp table.

Prefix indexes (KEY items_url_unique_index (url(255))) are rarely useful. This one might work; let's see the query that wants it. And the name includes "unique"; what gives?

type char(7) DEFAULT NULL takes 21 bytes because of utf8; maybe use VARCHAR or CHARACTER SET ascii for the column or turn into an ENUM, or ...

If you are going to have a date range, put the date column last in indexes like (date_queued_for_post,type).

Indexes on flags are almost never used. You may as well DROP them.

Another issue... You have a JOIN with a GROUP BY. This often signals inefficiency and/or wrong output. I think your inner query can be rewritten thus:

SELECT  items.*, x.matching_count
    FROM  
    (
        SELECT  item_id, COUNT(*) AS matching_count
            FROM  item_features
            WHERE  feature_id IN (...)
            GROUP BY  item_id 
    ) AS x
    JOIN  items USING(item_id)

The inner part of this should be quite efficient (especially after you renovate the many:many table). And it won't be fetching the bulky stuff from items until after doing the GROUP BY and COUNT.

Debug that before preceding.

After that, I am unclear whether you need the extra layer. Try this:

SELECT  items.*, x.matching_count
    FROM  
    (
        SELECT  item_id, COUNT(*) AS matching_count
            FROM  item_features
            WHERE  feature_id IN (...)
            GROUP BY  item_id 
    ) AS x
    JOIN  items USING(item_id)
    LEFT JOIN  page_items pi  ON items.id = pi.item_id
      AND  pi.page_id = 654
    WHERE  pi.id IS NULL
    ORDER BY  x.matching_count DESC;

(I am unclear whether AND pi.page_id = 654 should be in the ON or in the WHERE.)