Mysql – How to improve this slow count query on MySQL

MySQLperformanceperformance-tuningquery-performance

I'm at a lost here. We have this query which takes around 50s to finish which I think it's too slow. Here's the query. We use MySQL from Amazon RDS running on db.r3.large and MySQL version MySQL 5.7.17

SELECT bucket_label            AS bid, 
       Count(user_id)          AS c, 
       Count(DISTINCT user_id) AS cu 
FROM   event_impression 
WHERE  context = 'PROD' 
       AND experiment_id = Unhex(Replace("18454a99-ada6-41a8-b192-bcd3d5c514cb", 
                                 "-", 
                                 "")) 
       AND timestamp >= '2018-04-08 22:21:04' 
       AND timestamp <= '2018-04-10 22:21:04' 
GROUP  BY bucket_label; 

Here are the indexes

mysql> show index from event_impression;
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table            | Non_unique | Key_name                | Seq_in_index | Column_name   | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| event_impression |          1 | user_id                 |            1 | user_id       | A         |     3248866 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | experiment_id           |            1 | experiment_id | A         |        4305 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | bucket_label            |            1 | bucket_label  | A         |        7108 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | timestamp               |            1 | timestamp     | A         |     3315621 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | event_impression_ibfk_1 |            1 | experiment_id | A         |        2914 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | event_impression_ibfk_1 |            2 | bucket_label  | A         |        9619 |     NULL | NULL   |      | BTREE      |         |               |
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (0.01 sec)

Here's the table schema

mysql> describe event_impression;
+---------------+---------------+------+-----+---------+-------+
| Field         | Type          | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| user_id       | varchar(200)  | NO   | MUL | NULL    |       |
| experiment_id | varbinary(16) | NO   | MUL | NULL    |       |
| bucket_label  | varchar(64)   | NO   | MUL | NULL    |       |
| timestamp     | datetime      | NO   | MUL | NULL    |       |
| payload       | varchar(4096) | YES  |     | NULL    |       |
| context       | varchar(200)  | YES  |     | PROD    |       |
+---------------+---------------+------+-----+---------+-------+
6 rows in set (0.01 sec)

And here's the result of that query

mysql> select bucket_label as bid,
    -> count(user_id) as c,
    -> count(distinct user_id) as cu
    -> from wasabi.event_impression
    -> where context = 'PROD'
    -> and experiment_id = UNHEX(REPLACE("18454a99-ada6-41a8-b192-bcd3d5c514cb", "-",""))
    -> and timestamp >= '2018-04-08 22:21:04'
    -> and timestamp <= '2018-04-10 22:21:04'
    -> group by bucket_label;
+---------+--------+-------+
| bid     | c      | cu    |
+---------+--------+-------+
| 1       | 294308 | 22403 |
| 1_1     | 185561 | 14703 |
| 2_1     | 267417 | 22183 |
| 2_2     | 284134 | 21945 |
+---------+--------+-------+
4 rows in set (41.22 sec)

I'm open to any suggestions to improve this query. Or if you want to see any settings also please let me know.

This is how big the table is

mysql> select count(*) from event_impression;
+----------+
| count(*) |
+----------+
| 40955148 |
+----------+
1 row in set (10.88 sec)

This is the EXPLAIN

+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
| id | select_type | table            | partitions | type | possible_keys                                                | key                     | key_len | ref   | rows     | filtered | Extra                              |
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
|  1 | SIMPLE      | event_impression | NULL       | ref  | experiment_id,bucket_label,timestamp,event_impression_ibfk_1 | event_impression_ibfk_1 | 18      | const | 14958978 |     1.43 | Using index condition; Using where |
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
1 row in set, 1 warning (0.16 sec)

EDIT

explain select bucket_label as bid, 
count(user_id) as c, 
count(distinct user_id) as cu
from wasabi.event_impression
where context = 'PROD' 
and experiment_id = UNHEX(REPLACE("18454a99-ada6-41a8-b192-bcd3d5c514cb", "-",""))
and timestamp >= '2018-04-08 22:21:04'
and timestamp <= '2018-04-10 22:21:04'
group by bucket_label;

'1', 'SIMPLE', 'event_impression', NULL, 'ref', 'experiment_id,bucket_label,timestamp,event_impression_ibfk_1', 'event_impression_ibfk_1', '18', 'const', '14551230', '1.32', 'Using index condition; Using where'

EDIT

Switch the where statement to have timestamp at the beginning.

explain select bucket_label as bid, 
count(user_id) as c, 
count(distinct user_id) as cu
from wasabi.event_impression
where timestamp BETWEEN '2018-04-08 22:21:04' AND '2018-04-10 22:21:04'
AND context = 'PROD'
AND experiment_id = UNHEX(REPLACE("18454a99-ada6-41a8-b192-bcd3d5c514cb", "-",""))
group by bucket_label;

+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
| id | select_type | table            | partitions | type | possible_keys                                                | key                     | key_len | ref   | rows     | filtered | Extra                              |
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
|  1 | SIMPLE      | event_impression | NULL       | ref  | experiment_id,bucket_label,timestamp,event_impression_ibfk_1 | event_impression_ibfk_1 | 18      | const | 16274608 |     1.22 | Using index condition; Using where |
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+


mysql> show index from wasabi.event_impression;
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table            | Non_unique | Key_name                | Seq_in_index | Column_name   | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| event_impression |          1 | user_id                 |            1 | user_id       | A         |     3836266 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | experiment_id           |            1 | experiment_id | A         |        5083 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | bucket_label            |            1 | bucket_label  | A         |        8393 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | timestamp               |            1 | timestamp     | A         |     3915091 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | event_impression_ibfk_1 |            1 | experiment_id | A         |        3441 |     NULL | NULL   |      | BTREE      |         |               |
| event_impression |          1 | event_impression_ibfk_1 |            2 | bucket_label  | A         |       11358 |     NULL | NULL   |      | BTREE      |         |               |
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (0.01 sec)

Best Answer

Start an index with = parts of the WHERE, then add one "range" on:

INDEX(context, experiment_id,   -- in either order
      timestamp)

That will make the query run faster.

More on creating optimal indexes.

Note: a 'composite' index is not the same as having multiple single-column indexes.

A "covering" index would be bulky, but make it run a somewhat faster:

INDEX(context, experiment_id,   -- in either order
      timestamp,
      bucket_label, user_id)    -- in either order

Do not put timestamp first.

Index usage works from left to right. Scanning for rows will do what it can with the first column in the index, then move on to the next.

If the first column is tested with = (eg, context = 'PROD'), the all the rows in the index that match that are adjacent, and the next column can be useful.

If the first column is tested with a 'range' (eg, timestamp BETWEEN ... AND ...), the next column becomes useless. So the Optimizer stops at the first range.

A familiar example... Suppose there is a list of people, sorted by lastname first. And we want to find 'Dave Poole'.

INDEX(last, first)

I suspect there is little quibble with

WHERE first = 'Dave'
  AND last  = 'Poole'   -- (in either order)

But now, what if the query is

WHERE first LIKE 'D%'   -- (this is one form of "range")
  AND last = 'Poole'

This turns out to be efficient -- Drill down the BTree to the long list of 'Pooles' to where the D's are, then scan forward.

On the other hand, what about

WHERE first = 'Dave'
  AND last LIKE 'P%'

Now how do you find the entries? Well, you can quickly drill down the BTree to where P... starts, but you must scan all of the entries with last name starting with P. There is more efficient way than that.

(Of course, having INDEX(first, last) would be efficient.)

As for selectiveness / cardinality...

  • Searching for Poole+Dave is very selective -- it is the combination of both. The BTree does not Poole, then Dave; it does both at the same time.
  • Searching for Poole+D% is also the combined selectiveness of pair, which is more than either last or first.
  • Searching for P%+... can only use the selectiveness of the "P" in last, hence the worst case.

So, starting with 2018... (timestamp) is worse than PROD+abcd+2018 (the index I recommended).