Mysql – Aggregate query optimization

index-tuningMySQLmysql-5.5

Could you please help me to optimize this query – currently it takes ~4 minutes to run this query. It looks like keys: ip and timestamp are non-unique so shall I add some indexes?

mysql> EXPLAIN
    SELECT max(url) as url, 
           max(title) as title, 
           keyword as keyword,
           COALESCE(max(domain), 'example.com') as domain 
    FROM `test_url` as tu 
      JOIN `test_log` AS tl 
        ON  tl.shorturl = tu.keyword 
        AND tl.click_id > 
            (SELECT option_value 
             from `test_options` 
             WHERE option_name = 'click_id' ) 
    WHERE 1 = 1 
    GROUP BY keyword 
    HAVING count( keyword ) > 50 
    ORDER BY count( keyword ) DESC 
    LIMIT 10;
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
| id | select_type | table        | type   | possible_keys    | key         | key_len | ref                   | rows    | Extra                                        |
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
|  1 | PRIMARY     | tl           | range  | PRIMARY,shorturl | PRIMARY     | 4       | NULL                  | 5748586 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY     | tu           | eq_ref | PRIMARY          | PRIMARY     | 602     | test_urls.tl.shorturl |       1 |                                              |
|  2 | SUBQUERY    | test_options | ref    | option_name      | option_name | 194     |                       |       1 | Using where                                  |
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
3 rows in set (0.00 sec)

mysql> SHOW CREATE TABLE test_url\G
*************************** 1. row ***************************
       Table: test_url
Create Table: CREATE TABLE `test_url` (
  `keyword` varchar(200) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `url` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `title` text,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `ip` varchar(41) NOT NULL,
  `clicks` int(10) unsigned NOT NULL,
  `username` varchar(255) NOT NULL DEFAULT '',
  `company_id` int(11) NOT NULL DEFAULT '0',
  `domain` varchar(31) NOT NULL DEFAULT '',
  PRIMARY KEY (`keyword`),
  KEY `timestamp` (`timestamp`),
  KEY `ip` (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> SHOW INDEXES FROM test_url;
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table    | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_url |          0 | PRIMARY   |            1 | keyword     | A         |      176798 |     NULL | NULL   |      | BTREE      |         |               |
| test_url |          1 | timestamp |            1 | timestamp   | A         |      176798 |     NULL | NULL   |      | BTREE      |         |               |
| test_url |          1 | ip        |            1 | ip          | A         |          11 |     NULL | NULL   |      | BTREE      |         |               |
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
3 rows in set (0.08 sec)

Edit: Tue Jun 17 15:17:36 BST 2014

mysql> SHOW CREATE TABLE test_log\G
*************************** 1. row ***************************
       Table: test_log
Create Table: CREATE TABLE `test_log` (
  `click_id` int(11) NOT NULL AUTO_INCREMENT,
  `click_time` datetime NOT NULL,
  `shorturl` varchar(200) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `referrer` varchar(200) NOT NULL,
  `user_agent` varchar(255) NOT NULL,
  `ip_address` varchar(41) NOT NULL,
  `country_code` char(2) NOT NULL,
  PRIMARY KEY (`click_id`),
  KEY `shorturl` (`shorturl`)
) ENGINE=InnoDB AUTO_INCREMENT=51636348 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> SHOW CREATE TABLE test_options\G
*************************** 1. row ***************************
       Table: test_options
Create Table: CREATE TABLE `test_options` (
  `option_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `option_name` varchar(64) NOT NULL DEFAULT '',
  `option_value` longtext NOT NULL,
  PRIMARY KEY (`option_id`,`option_name`),
  KEY `option_name` (`option_name`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

Edit: Fri Jun 20 10:40:35 BST 2014

Database size is 4975MB (3965MB data size + 1010MB index size)

MySQL veriosn: 5.5.35
OS: Ubuntu 12.04

This is basically this application: http://yourls.org/

The main problem is that this query is causing a high load on the server using temporary tables created on disk (State: Copying to tmp table)

Best Answer

I bear bad news. I've studied your tables and your query and looked at many variations on them, ran them against a synthetic data set approximately the same size as yours. And I'm sorry to report that I'm 98% sure that your query is nearly optimal and that you already have the right indexes in place. I welcome anyone who can disprove this and come up with additional indexes + rewritten sql that can significantly improve performance.

Aggregation != Searching. You're doing OLAP. Your query seems to be doing aggregation over about 5 million rows. Normally if I see a query doing a full table scan or a wide index range scan, I would suspect that indexes might not be optimal unless (big unless here) you're doing an aggregation (such as COUNT, SUM, AVG, etc), then there's no way around accessing all those rows since you need to compute on them.

I don't believe you are going to get significantly quicker results simply by adding indexes and rewriting the SQL alone. You'll need to incorporate some other approaches:

  • Pre-aggregating data. You could have some background job that runs and pre-aggregates your data by time intervals. This could be by 1-day intervals, 1-hour, 5-minutes, 1-minute ... whatever makes sense for your application in terms of how much latency you can tolerate.

  • Partitioning your data could help. I was thinking you might get some improvement by hash partitioning test_log over HASH(shorturl) in 32 hash partitions. But you'd need to redefine your PK and then maintaining the lifecycle of the data could become more difficult (not sure whether you're purging data already, but you probably should, or at least archive it).

Bigger picture, it looks like you're using a RDBMS (MySQL database) to do real-time keyword/click analysis. However, I don't believe an RDBMS is ideally suited for this task. Big data and stream processing are really becoming hot on this exact problem. The following search might get you in the right direction: http://www.google.com/search?q=stream+computing+real-time+click+analysis