I'm at a lost here. We have this query which takes around 50s to finish which I think it's too slow. Here's the query. We use MySQL from Amazon RDS running on db.r3.large
and MySQL version MySQL 5.7.17
SELECT bucket_label AS bid,
Count(user_id) AS c,
Count(DISTINCT user_id) AS cu
FROM event_impression
WHERE context = 'PROD'
AND experiment_id = Unhex(Replace("18454a99-ada6-41a8-b192-bcd3d5c514cb",
"-",
""))
AND timestamp >= '2018-04-08 22:21:04'
AND timestamp <= '2018-04-10 22:21:04'
GROUP BY bucket_label;
Here are the indexes
mysql> show index from event_impression;
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| event_impression | 1 | user_id | 1 | user_id | A | 3248866 | NULL | NULL | | BTREE | | |
| event_impression | 1 | experiment_id | 1 | experiment_id | A | 4305 | NULL | NULL | | BTREE | | |
| event_impression | 1 | bucket_label | 1 | bucket_label | A | 7108 | NULL | NULL | | BTREE | | |
| event_impression | 1 | timestamp | 1 | timestamp | A | 3315621 | NULL | NULL | | BTREE | | |
| event_impression | 1 | event_impression_ibfk_1 | 1 | experiment_id | A | 2914 | NULL | NULL | | BTREE | | |
| event_impression | 1 | event_impression_ibfk_1 | 2 | bucket_label | A | 9619 | NULL | NULL | | BTREE | | |
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (0.01 sec)
Here's the table schema
mysql> describe event_impression;
+---------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| user_id | varchar(200) | NO | MUL | NULL | |
| experiment_id | varbinary(16) | NO | MUL | NULL | |
| bucket_label | varchar(64) | NO | MUL | NULL | |
| timestamp | datetime | NO | MUL | NULL | |
| payload | varchar(4096) | YES | | NULL | |
| context | varchar(200) | YES | | PROD | |
+---------------+---------------+------+-----+---------+-------+
6 rows in set (0.01 sec)
And here's the result of that query
mysql> select bucket_label as bid,
-> count(user_id) as c,
-> count(distinct user_id) as cu
-> from wasabi.event_impression
-> where context = 'PROD'
-> and experiment_id = UNHEX(REPLACE("18454a99-ada6-41a8-b192-bcd3d5c514cb", "-",""))
-> and timestamp >= '2018-04-08 22:21:04'
-> and timestamp <= '2018-04-10 22:21:04'
-> group by bucket_label;
+---------+--------+-------+
| bid | c | cu |
+---------+--------+-------+
| 1 | 294308 | 22403 |
| 1_1 | 185561 | 14703 |
| 2_1 | 267417 | 22183 |
| 2_2 | 284134 | 21945 |
+---------+--------+-------+
4 rows in set (41.22 sec)
I'm open to any suggestions to improve this query. Or if you want to see any settings also please let me know.
This is how big the table is
mysql> select count(*) from event_impression;
+----------+
| count(*) |
+----------+
| 40955148 |
+----------+
1 row in set (10.88 sec)
This is the EXPLAIN
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
| 1 | SIMPLE | event_impression | NULL | ref | experiment_id,bucket_label,timestamp,event_impression_ibfk_1 | event_impression_ibfk_1 | 18 | const | 14958978 | 1.43 | Using index condition; Using where |
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
1 row in set, 1 warning (0.16 sec)
EDIT
explain select bucket_label as bid,
count(user_id) as c,
count(distinct user_id) as cu
from wasabi.event_impression
where context = 'PROD'
and experiment_id = UNHEX(REPLACE("18454a99-ada6-41a8-b192-bcd3d5c514cb", "-",""))
and timestamp >= '2018-04-08 22:21:04'
and timestamp <= '2018-04-10 22:21:04'
group by bucket_label;
'1', 'SIMPLE', 'event_impression', NULL, 'ref', 'experiment_id,bucket_label,timestamp,event_impression_ibfk_1', 'event_impression_ibfk_1', '18', 'const', '14551230', '1.32', 'Using index condition; Using where'
EDIT
Switch the where statement to have timestamp at the beginning.
explain select bucket_label as bid,
count(user_id) as c,
count(distinct user_id) as cu
from wasabi.event_impression
where timestamp BETWEEN '2018-04-08 22:21:04' AND '2018-04-10 22:21:04'
AND context = 'PROD'
AND experiment_id = UNHEX(REPLACE("18454a99-ada6-41a8-b192-bcd3d5c514cb", "-",""))
group by bucket_label;
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
| 1 | SIMPLE | event_impression | NULL | ref | experiment_id,bucket_label,timestamp,event_impression_ibfk_1 | event_impression_ibfk_1 | 18 | const | 16274608 | 1.22 | Using index condition; Using where |
+----+-------------+------------------+------------+------+--------------------------------------------------------------+-------------------------+---------+-------+----------+----------+------------------------------------+
mysql> show index from wasabi.event_impression;
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| event_impression | 1 | user_id | 1 | user_id | A | 3836266 | NULL | NULL | | BTREE | | |
| event_impression | 1 | experiment_id | 1 | experiment_id | A | 5083 | NULL | NULL | | BTREE | | |
| event_impression | 1 | bucket_label | 1 | bucket_label | A | 8393 | NULL | NULL | | BTREE | | |
| event_impression | 1 | timestamp | 1 | timestamp | A | 3915091 | NULL | NULL | | BTREE | | |
| event_impression | 1 | event_impression_ibfk_1 | 1 | experiment_id | A | 3441 | NULL | NULL | | BTREE | | |
| event_impression | 1 | event_impression_ibfk_1 | 2 | bucket_label | A | 11358 | NULL | NULL | | BTREE | | |
+------------------+------------+-------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (0.01 sec)
Best Answer
Start an index with
=
parts of theWHERE
, then add one "range" on:That will make the query run faster.
More on creating optimal indexes.
Note: a 'composite' index is not the same as having multiple single-column indexes.
A "covering" index would be bulky, but make it run a somewhat faster:
Do not put
timestamp
first.Index usage works from left to right. Scanning for rows will do what it can with the first column in the index, then move on to the next.
If the first column is tested with
=
(eg,context = 'PROD'
), the all the rows in the index that match that are adjacent, and the next column can be useful.If the first column is tested with a 'range' (eg,
timestamp BETWEEN ... AND ...
), the next column becomes useless. So the Optimizer stops at the first range.A familiar example... Suppose there is a list of people, sorted by lastname first. And we want to find 'Dave Poole'.
I suspect there is little quibble with
But now, what if the query is
This turns out to be efficient -- Drill down the BTree to the long list of 'Pooles' to where the D's are, then scan forward.
On the other hand, what about
Now how do you find the entries? Well, you can quickly drill down the BTree to where P... starts, but you must scan all of the entries with last name starting with P. There is more efficient way than that.
(Of course, having
INDEX(first, last)
would be efficient.)As for selectiveness / cardinality...
last
orfirst
.last
, hence the worst case.So, starting with 2018... (
timestamp
) is worse than PROD+abcd+2018 (the index I recommended).