Mysql – Optimizing intersect query in MySQL

MySQLperformanceperformance-tuningquery-performance

I'm facing a performances problem in our MariaDB server.

We have a table linking users to pages, a user belongs to 1+ pages.

I have a page that have users and I want to get the 10 best pages based on users they share with the current page (page_id).

Here is the schema :

CREATE TABLE IF NOT EXISTS `pages_users3` (
  `user_id` bigint(20) unsigned NOT NULL,
  `page_id` bigint(20) unsigned NOT NULL,
  `comments` int(11) NOT NULL DEFAULT '0',
  `likes` int(11) NOT NULL DEFAULT '0',
  `shares` int(11) NOT NULL DEFAULT '0',
  `posts` int(11) NOT NULL DEFAULT '0',
  `interactions` int(11) NOT NULL DEFAULT '0',
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`page_id`,`user_id`) `clustering`=yes,
  KEY `user_page` (`user_id`)
) ENGINE=TokuDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci `compression`='tokudb_zlib' AUTO_INCREMENT=30127543590 ;

I have a query that can retrieve similar pages of a page based on users intersections :

SELECT `pages_users`.`page_id` , COUNT(*) AS count
FROM `pages_users`
LEFT JOIN `pages_users` AS `pu2` ON `pu2`.`user_id` = `pages_users`.`user_id`
WHERE `pages_users`.`page_id` != *PAGE_ID*
AND `pu2`.`page_id` = *PAGE_ID* /* To avoid getting the current page in results */
GROUP BY `pages_users`.`page_id`
ORDER BY `count` DESC
LIMIT 10

Unfortunaly this query runs in about 1m30s, so I would like to optimize it…

Here is the result of an explain :

+------+-------------+-------------+--------+-----------------------------------------------------------------------------------+------------------------------------------------+---------+----------------------------------------+-----------+-----------------------------------------------------------+
| id   | select_type | table       | type   | possible_keys                                                                     | key                                            | key_len | ref                                    | rows      | Extra                                                     |
+------+-------------+-------------+--------+-----------------------------------------------------------------------------------+------------------------------------------------+---------+----------------------------------------+-----------+-----------------------------------------------------------+
|    1 | SIMPLE      | pages_users | index  | pages_users_user_id_page_id_unique,pages_users_user_id_page_id_interactions_index | pages_users_user_id_page_id_interactions_index | 20      | NULL                                   | 165354919 | Using where; Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | pu2         | eq_ref | pages_users_user_id_page_id_unique,pages_users_user_id_page_id_interactions_index | pages_users_user_id_page_id_unique             | 16      | soprism_prod.pages_users.user_id,const |         1 | Using index                                               |
+------+-------------+-------------+--------+-----------------------------------------------------------------------------------+------------------------------------------------+---------+----------------------------------------+-----------+-----------------------------------------------------------+
2 rows in set (0.00 sec)

I'm requesting your help because I spent many hours trying to optimize that query and I feel confused 🙁
I'm open to anything, from refactoring to changing server parameters…
Thank you !

Some additional insights :

SELECT COUNT(user_id) FROM `pages_users` WHERE `page_id` = *PAGE_ID*;
+----------------+
| COUNT(user_id) |
+----------------+
|         146132 |
+----------------+
1 row in set (0.08 sec)

SELECT COUNT(user_id) FROM `pages_users`;
+----------------+
| COUNT(user_id) |
+----------------+
|      165354918 |
+----------------+
1 row in set (1 min 0.98 sec)

Hardware : Quad Core Xeon, 32Gb RAM, Raid 1 2x2Tb

Best Answer

Please try: (edited 2014-12-18: replaced Left with INNER join):

SELECT 
    `pages_users`.`page_id` , 
    COUNT(`pu2`.`page_id`) AS `count`
FROM 
    `pages_users`
    INNER JOIN `pages_users` AS `pu2` 
        ON `pu2`.`user_id` = `pages_users`.`user_id`
           AND `pu2`.`page_id` = *PAGE_ID* 
WHERE 
    `pages_users`.`page_id` != *PAGE_ID*
GROUP BY 
    `pages_users`.`page_id`
ORDER BY 
    `count` DESC
LIMIT 10

or as an alternative

SELECT 
    `pages_users`.`page_id` , 
    COUNT(`pu2`.`pu2_page_id`) AS `count`
FROM 
    `pages_users`
    INNER JOIN 
    (
        SELECT  
            `pu2_intern`.`page_id` AS `pu2_page_id`
        FROM
            `pages_users` AS `pu2_intern`
        WHERE
            `pu2_intern`.`page_id` = *PAGE_ID*
    )  AS `pu2`
        ON `pu2`.`user_id` = `pages_users`.`user_id`
WHERE 
    `pages_users`.`page_id` != *PAGE_ID*
GROUP BY 
    `pages_users`.`page_id`
ORDER BY 
    `count` DESC
LIMIT 10

You could also add another index just on the field page_id:

KEY `page` (`page_id`)

I hope it could help you.