My query runs in 60-90 seconds, I would like to see how can I improve it.
One table contain partitions and have tens of million rows each partition.
The other table contain 5-8 million rows between two siteId
.
Basically the join gives the actual ID
by mLongId
. I'm wondering if I can remove the Using temporary; Using filesort
or use more columns in the used_key_parts
step.
dt
is datetime
but we save there only date, meaning where r.dt = '2021-01-01'
returns everything for that day.
@Rick James, Good catch about the function. I will convert tMovies
into utf8mb4
.
Update: after table rebuild and new unique index (see below) – plan looks different and SQL runs at 10-39s (used to be 60-80s) – I'm putting the new plan at the end of this Q. Would be nice to hear farther tuning.
tMovies: UNIQUE index ux_covering (mLongId,siteId,movieId,catalogId,uId)
Select:
select t.siteId,
r.dt,
t.movieId as movieId,
t.catalogId as catalogId,
t.uId as uId,
r.sysId,
sum(r.views) as views
from raw r
inner join tMovies t on r.mLongId = t.mLongId and t.movieId is not null and r.siteId = t.siteId
where r.dt = '2021-01-01' and r.siteId=2
group by t.siteId, t.movieId, t.catalogId, t.uId, r.dt, r.sysId
raw : KEY `ix_composite` (`dt`,`siteId`,`mLongId`,`sysId`,`views`)
tMovies : KEY `ix_composite` (`siteId`,`movieId`,`mLongId`,`catalogId`,`uId`)
+----+-------------+-------+------------+-------+-----------------+---------+------------------+----------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+-----------------+---------+------------------+----------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | t | NULL | range | ix_composite | 10 | NULL | 10098009 | 100.00 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | r | p20210101 | ref | ix_composite | 1028 | const,const,func | 1 | 100.00 | Using where; Using index |
+----+-------------+-------+------------+-------+-----------------+---------+------------------+----------+----------+-----------------------------------------------------------+
and
EXPLAIN: {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "9501888.33"
},
"grouping_operation": {
"using_temporary_table": true,
"using_filesort": true,
"nested_loop": [
{
"table": {
"table_name": "t",
"access_type": "range",
"possible_keys": [
"movieId",
"ix_composite"
],
"key": "ix_composite",
"used_key_parts": [
"siteId",
"movieId"
],
"key_length": "10",
"rows_examined_per_scan": 10098009,
"rows_produced_per_join": 5049004,
"filtered": "100.00",
"using_index": true,
"cost_info": {
"read_cost": "336601.27",
"eval_cost": "1009800.90",
"prefix_cost": "2356203.07",
"data_read_per_join": "1G"
},
"used_columns": [
"id",
"mLongId",
"siteId",
"movieId",
"catalogId",
"uId"
],
"attached_condition": "((`st`.`t`.`siteId` = 3) and (`st`.`t`.`movieId` is not null))"
}
},
{
"table": {
"table_name": "r",
"partitions": [
"p20210101"
],
"access_type": "ref",
"possible_keys": [
"ix_composite"
],
"key": "ix_song_day_composite",
"used_key_parts": [
"dt",
"siteId",
"mLongId"
],
"key_length": "1028",
"ref": [
"const",
"const",
"func"
],
"rows_examined_per_scan": 1,
"rows_produced_per_join": 8393250,
"filtered": "100.00",
"using_index": true,
"cost_info": {
"read_cost": "5467035.22",
"eval_cost": "1678650.05",
"prefix_cost": "9501888.33",
"data_read_per_join": "8G"
},
"used_columns": [
"id",
"siteId",
"mLongId",
"sysId",
"views",
"dt"
],
"attached_condition": "(`st`.`r`.`mLongId` = convert(`st`.`t`.`mLongId` using utf8mb4))"
}
}
]
}
}
}
and
CREATE TABLE raw (
id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
siteId tinyint(4) NOT NULL,
mLongId varchar(255) NOT NULL COMMENT 'spotify song id (href) / apple id',
sysId int(11) DEFAULT NULL,
views bigint(20) unsigned NOT NULL,
dt datetime NOT NULL,
PRIMARY KEY (id,dt),
KEY ix_composite (dt,siteId,mLongId,sysId,views)
) CHARSET=utf8mb4
PARTITION BY RANGE COLUMNS(dt)
(PARTITION p20171201 VALUES LESS THAN ('2018-01-01'),
PARTITION p20180101 VALUES LESS THAN ('2018-02-01'),
PARTITION p20180201 VALUES LESS THAN ('2018-03-01'),
PARTITION p20180301 VALUES LESS THAN ('2018-04-01'),
PARTITION p20180401 VALUES LESS THAN ('2018-05-01'),
PARTITION p20180501 VALUES LESS THAN ('2018-06-01'),
PARTITION p20180601 VALUES LESS THAN ('2018-07-01'),
PARTITION p20180701 VALUES LESS THAN ('2018-08-01'),
PARTITION p20180801 VALUES LESS THAN ('2018-09-01'),
PARTITION p20180901 VALUES LESS THAN ('2018-10-01'),
PARTITION p20181001 VALUES LESS THAN ('2018-11-01'),
PARTITION p20181101 VALUES LESS THAN ('2018-12-01'),
PARTITION p20181201 VALUES LESS THAN ('2019-01-01'),
PARTITION p20190101 VALUES LESS THAN ('2019-02-01'),
PARTITION p20190201 VALUES LESS THAN ('2019-03-01'),
PARTITION p20190301 VALUES LESS THAN ('2019-04-01'),
PARTITION p20190401 VALUES LESS THAN ('2019-05-01'),
PARTITION p20190501 VALUES LESS THAN ('2019-06-01'),
PARTITION p20190601 VALUES LESS THAN ('2019-07-01'),
PARTITION p20190701 VALUES LESS THAN ('2019-08-01'),
PARTITION p20190801 VALUES LESS THAN ('2019-09-01'),
PARTITION p20190901 VALUES LESS THAN ('2019-10-01'),
PARTITION p20191001 VALUES LESS THAN ('2019-11-01'),
PARTITION p20191101 VALUES LESS THAN ('2019-12-01'),
PARTITION p20191201 VALUES LESS THAN ('2020-01-01'),
PARTITION p20200101 VALUES LESS THAN ('2020-02-01'),
PARTITION p20200201 VALUES LESS THAN ('2020-03-01'),
PARTITION p20200301 VALUES LESS THAN ('2020-04-01'),
PARTITION p20200401 VALUES LESS THAN ('2020-05-01'),
PARTITION p20200501 VALUES LESS THAN ('2020-06-01'),
PARTITION p20200601 VALUES LESS THAN ('2020-07-01'),
PARTITION p20200701 VALUES LESS THAN ('2020-08-01'),
PARTITION p20200801 VALUES LESS THAN ('2020-09-01'),
PARTITION p20200901 VALUES LESS THAN ('2020-10-01'),
PARTITION p20201001 VALUES LESS THAN ('2020-11-01'),
PARTITION p20201101 VALUES LESS THAN ('2020-12-01'),
PARTITION p20201201 VALUES LESS THAN ('2021-01-01'),
PARTITION p20210101 VALUES LESS THAN ('2021-02-01'),
PARTITION p20210201 VALUES LESS THAN ('2021-03-01'),
PARTITION p20210301 VALUES LESS THAN ('2021-04-01'))
CREATE TABLE tMovies (
id int(11) NOT NULL AUTO_INCREMENT,
mLongId varchar(255) DEFAULT NULL,
siteId int(11) DEFAULT NULL,
movieId int(11) DEFAULT NULL,
catalogId int(11) DEFAULT NULL,
uId int(11) DEFAULT NULL,
PRIMARY KEY (id),
UNIQUE KEY mLongId (mLongId),
UNIQUE index ux_covering (mLongId,siteId,movieId,catalogId,uId),
KEY ix_composite (siteId,movieId,mLongId,catalogId,uId),
) CHARSET=utf8mb4
Updated plan after table rebuild with utf8mb4:
+----+-------------+-------+------------+------+--------------+---------+-------------------+---------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+--------------+---------+-------------------+---------+----------+----------------------------------------------+
| 1 | SIMPLE | r | p20210101 | ref | ix_composite | 6 | const,const | 1285520 | 100.00 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t | NULL | ref | ux_covering | 1028 | st.r.mLongId,const| 1 | 50.00 | Using where; Using index |
+----+-------------+-------+------------+------+--------------+---------+-------------------+---------+----------+----------------------------------------------+
EXPLAIN: {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "1965128.98"
},
"grouping_operation": {
"using_temporary_table": true,
"using_filesort": true,
"nested_loop": [
{
"table": {
"table_name": "r",
"partitions": [
"p20210101"
],
"access_type": "ref",
"possible_keys": [
"ix_composite"
],
"key": "ix_composite",
"used_key_parts": [
"dt",
"siteId"
],
"key_length": "6",
"ref": [
"const",
"const"
],
"rows_examined_per_scan": 1285520,
"rows_produced_per_join": 1285520,
"filtered": "100.00",
"using_index": true,
"cost_info": {
"read_cost": "160690.88",
"eval_cost": "257104.00",
"prefix_cost": "417794.88",
"data_read_per_join": "1G"
},
"used_columns": [
"id",
"siteId",
"mLongId",
"sysId",
"views",
"dt"
]
}
},
{
"table": {
"table_name": "t",
"access_type": "ref",
"possible_keys": [
"mLongId",
"ux_covering",
"ix_composite"
],
"key": "ux_covering",
"used_key_parts": [
"mLongId",
"siteId"
],
"key_length": "1028",
"ref": [
"st.r.mLongId",
"const"
],
"rows_examined_per_scan": 1,
"rows_produced_per_join": 650006,
"filtered": "50.00",
"using_index": true,
"cost_info": {
"read_cost": "1287331.58",
"eval_cost": "130001.26",
"prefix_cost": "1965128.98",
"data_read_per_join": "654M"
},
"used_columns": [
"id",
"mLongId",
"siteId",
"movieId",
"catalogId",
"uId"
],
"attached_condition": "(`st`.`t`.`movieId` is not null)"
}
}
]
}
}
}
Best Answer
What datatype is
dt
? If it isDATETIME
, thenr.dt = '2021-01-01'
only checks for midnight. If you really wanted all day, then doIf it is datatype
DATE
then your formulation is OK. (And my formulation and index are not broken.)(My suggested index is already set up for this change.)
What is the table
PARTITIONed
on? (Please provideSHOW CREATE TABLE
for each table; there may be other issues to investigate.) I see "func" in the Explain; that sounds 'bad'; theSHOWs
should clarify what is going on."Filesort" and "temporary" are symptoms of a problem; they are not necessarily the villains. My suggestions are designed to speed up the query; I don't now whether they will eliminate them.
The order of the columns in an index is important in helping the columns be used. As soon as a "range" is hit, no further "used_key_parts" will be used.
IS NOT NULL
is a range test.More
Datatypes should be the same in both tables when
JOINing
.mLongId
can't use an index untile you get theCOLLATION to be the same. (Numeric consistency is less critical -- cf
siteId.)In general, it not efficient to start any index with the column being
PARTITIONed
on. (I am thinking ofdt
.) Partition pruning is already doing some filtering on the column, so you may as well start the index with something else.If
t.id
is not used anywhere else, get rid of it andPRIMARY KEY
, then update theUNIQUE.mLongId
to be the PK.So far, I don't see any benefit in the partitioning (versus what could be done with just indexes). Perhaps some other queries benefit from partitioning?
Looking at just this:
Since
t.mLongId
isUNIQUE
, if the Optimizer decides to start withr
, then when it moves on tot
,mLongId
will specify only one row. The test onsiteId
seems unnecessary. (Actually, all it can do is filter out the row ifsiteId
does not match.) My point is: There may be a bug in the query.