Mysql – Tuning query – reduce temporary and/or filesort use with join

join;MySQLmysql-5.7performanceperformance-tuning

My query runs in 60-90 seconds, I would like to see how can I improve it.
One table contain partitions and have tens of million rows each partition.
The other table contain 5-8 million rows between two siteId.
Basically the join gives the actual ID by mLongId. I'm wondering if I can remove the Using temporary; Using filesort or use more columns in the used_key_parts step.

dt is datetime but we save there only date, meaning where r.dt = '2021-01-01' returns everything for that day.

@Rick James, Good catch about the function. I will convert tMovies into utf8mb4.

Update: after table rebuild and new unique index (see below) – plan looks different and SQL runs at 10-39s (used to be 60-80s) – I'm putting the new plan at the end of this Q. Would be nice to hear farther tuning.

tMovies: UNIQUE index ux_covering (mLongId,siteId,movieId,catalogId,uId)

Select:

select t.siteId, 
    r.dt,  
    t.movieId as movieId, 
    t.catalogId as catalogId,
    t.uId as uId, 
    r.sysId,
    sum(r.views) as views
from raw r
inner join tMovies t on r.mLongId = t.mLongId and t.movieId is not null and r.siteId = t.siteId
where r.dt = '2021-01-01' and r.siteId=2
group by t.siteId, t.movieId, t.catalogId, t.uId, r.dt, r.sysId

raw :  KEY `ix_composite` (`dt`,`siteId`,`mLongId`,`sysId`,`views`)
tMovies : KEY `ix_composite` (`siteId`,`movieId`,`mLongId`,`catalogId`,`uId`)


+----+-------------+-------+------------+-------+-----------------+---------+------------------+----------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type  | key             | key_len | ref              | rows     | filtered | Extra                                                     |
+----+-------------+-------+------------+-------+-----------------+---------+------------------+----------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | t     | NULL       | range | ix_composite    | 10      | NULL             | 10098009 |   100.00 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | r     | p20210101  | ref   | ix_composite    | 1028    | const,const,func |        1 |   100.00 | Using where; Using index                                  |
+----+-------------+-------+------------+-------+-----------------+---------+------------------+----------+----------+-----------------------------------------------------------+

and

EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "9501888.33"
    },
    "grouping_operation": {
      "using_temporary_table": true,
      "using_filesort": true,
      "nested_loop": [
        {
          "table": {
            "table_name": "t",
            "access_type": "range",
            "possible_keys": [
              "movieId",
              "ix_composite"
            ],
            "key": "ix_composite",
            "used_key_parts": [
              "siteId",
              "movieId"
            ],
            "key_length": "10",
            "rows_examined_per_scan": 10098009,
            "rows_produced_per_join": 5049004,
            "filtered": "100.00",
            "using_index": true,
            "cost_info": {
              "read_cost": "336601.27",
              "eval_cost": "1009800.90",
              "prefix_cost": "2356203.07",
              "data_read_per_join": "1G"
            },
            "used_columns": [
              "id",
              "mLongId",
              "siteId",
              "movieId",
              "catalogId",
              "uId"
            ],
            "attached_condition": "((`st`.`t`.`siteId` = 3) and (`st`.`t`.`movieId` is not null))"
          }
        },
        {
          "table": {
            "table_name": "r",
            "partitions": [
              "p20210101"
            ],
            "access_type": "ref",
            "possible_keys": [
              "ix_composite"
            ],
            "key": "ix_song_day_composite",
            "used_key_parts": [
              "dt",
              "siteId",
              "mLongId"
            ],
            "key_length": "1028",
            "ref": [
              "const",
              "const",
              "func"
            ],
            "rows_examined_per_scan": 1,
            "rows_produced_per_join": 8393250,
            "filtered": "100.00",
            "using_index": true,
            "cost_info": {
              "read_cost": "5467035.22",
              "eval_cost": "1678650.05",
              "prefix_cost": "9501888.33",
              "data_read_per_join": "8G"
            },
            "used_columns": [
              "id",
              "siteId",
              "mLongId",
              "sysId",
              "views",
              "dt"
            ],
            "attached_condition": "(`st`.`r`.`mLongId` = convert(`st`.`t`.`mLongId` using utf8mb4))"
          }
        }
      ]
    }
  }
}

and

CREATE TABLE raw (
  id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  siteId tinyint(4) NOT NULL,
  mLongId varchar(255) NOT NULL COMMENT 'spotify song id (href) / apple id',
  sysId int(11) DEFAULT NULL,
  views bigint(20) unsigned NOT NULL,
  dt datetime NOT NULL,
  PRIMARY KEY (id,dt),
  KEY ix_composite (dt,siteId,mLongId,sysId,views)
) CHARSET=utf8mb4
PARTITION BY RANGE  COLUMNS(dt)
(PARTITION p20171201 VALUES LESS THAN ('2018-01-01'),
 PARTITION p20180101 VALUES LESS THAN ('2018-02-01'),
 PARTITION p20180201 VALUES LESS THAN ('2018-03-01'),
 PARTITION p20180301 VALUES LESS THAN ('2018-04-01'),
 PARTITION p20180401 VALUES LESS THAN ('2018-05-01'),
 PARTITION p20180501 VALUES LESS THAN ('2018-06-01'),
 PARTITION p20180601 VALUES LESS THAN ('2018-07-01'),
 PARTITION p20180701 VALUES LESS THAN ('2018-08-01'),
 PARTITION p20180801 VALUES LESS THAN ('2018-09-01'),
 PARTITION p20180901 VALUES LESS THAN ('2018-10-01'),
 PARTITION p20181001 VALUES LESS THAN ('2018-11-01'),
 PARTITION p20181101 VALUES LESS THAN ('2018-12-01'),
 PARTITION p20181201 VALUES LESS THAN ('2019-01-01'),
 PARTITION p20190101 VALUES LESS THAN ('2019-02-01'),
 PARTITION p20190201 VALUES LESS THAN ('2019-03-01'),
 PARTITION p20190301 VALUES LESS THAN ('2019-04-01'),
 PARTITION p20190401 VALUES LESS THAN ('2019-05-01'),
 PARTITION p20190501 VALUES LESS THAN ('2019-06-01'),
 PARTITION p20190601 VALUES LESS THAN ('2019-07-01'),
 PARTITION p20190701 VALUES LESS THAN ('2019-08-01'),
 PARTITION p20190801 VALUES LESS THAN ('2019-09-01'),
 PARTITION p20190901 VALUES LESS THAN ('2019-10-01'),
 PARTITION p20191001 VALUES LESS THAN ('2019-11-01'),
 PARTITION p20191101 VALUES LESS THAN ('2019-12-01'),
 PARTITION p20191201 VALUES LESS THAN ('2020-01-01'),
 PARTITION p20200101 VALUES LESS THAN ('2020-02-01'),
 PARTITION p20200201 VALUES LESS THAN ('2020-03-01'),
 PARTITION p20200301 VALUES LESS THAN ('2020-04-01'),
 PARTITION p20200401 VALUES LESS THAN ('2020-05-01'),
 PARTITION p20200501 VALUES LESS THAN ('2020-06-01'),
 PARTITION p20200601 VALUES LESS THAN ('2020-07-01'),
 PARTITION p20200701 VALUES LESS THAN ('2020-08-01'),
 PARTITION p20200801 VALUES LESS THAN ('2020-09-01'),
 PARTITION p20200901 VALUES LESS THAN ('2020-10-01'),
 PARTITION p20201001 VALUES LESS THAN ('2020-11-01'),
 PARTITION p20201101 VALUES LESS THAN ('2020-12-01'),
 PARTITION p20201201 VALUES LESS THAN ('2021-01-01'),
 PARTITION p20210101 VALUES LESS THAN ('2021-02-01'),
 PARTITION p20210201 VALUES LESS THAN ('2021-03-01'),
 PARTITION p20210301 VALUES LESS THAN ('2021-04-01'))
 
 
CREATE TABLE tMovies (
  id int(11) NOT NULL AUTO_INCREMENT,
  mLongId varchar(255) DEFAULT NULL,
  siteId int(11) DEFAULT NULL,
  movieId int(11) DEFAULT NULL,
  catalogId int(11) DEFAULT NULL,
  uId int(11) DEFAULT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY mLongId (mLongId),
  UNIQUE index ux_covering (mLongId,siteId,movieId,catalogId,uId),
  KEY ix_composite (siteId,movieId,mLongId,catalogId,uId),
) CHARSET=utf8mb4

Updated plan after table rebuild with utf8mb4:

+----+-------------+-------+------------+------+--------------+---------+-------------------+---------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | key          | key_len | ref               | rows    | filtered | Extra                                        |
+----+-------------+-------+------------+------+--------------+---------+-------------------+---------+----------+----------------------------------------------+
|  1 | SIMPLE      | r     | p20210101  | ref  | ix_composite | 6       | const,const       | 1285520 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | t     | NULL       | ref  | ux_covering  | 1028    | st.r.mLongId,const|       1 |    50.00 | Using where; Using index                     |
+----+-------------+-------+------------+------+--------------+---------+-------------------+---------+----------+----------------------------------------------+

EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "1965128.98"
    },
    "grouping_operation": {
      "using_temporary_table": true,
      "using_filesort": true,
      "nested_loop": [
        {
          "table": {
            "table_name": "r",
            "partitions": [
              "p20210101"
            ],
            "access_type": "ref",
            "possible_keys": [
              "ix_composite"
            ],
            "key": "ix_composite",
            "used_key_parts": [
              "dt",
              "siteId"
            ],
            "key_length": "6",
            "ref": [
              "const",
              "const"
            ],
            "rows_examined_per_scan": 1285520,
            "rows_produced_per_join": 1285520,
            "filtered": "100.00",
            "using_index": true,
            "cost_info": {
              "read_cost": "160690.88",
              "eval_cost": "257104.00",
              "prefix_cost": "417794.88",
              "data_read_per_join": "1G"
            },
            "used_columns": [
              "id",
              "siteId",
              "mLongId",
              "sysId",
              "views",
              "dt"
            ]
          }
        },
        {
          "table": {
            "table_name": "t",
            "access_type": "ref",
            "possible_keys": [
              "mLongId",
              "ux_covering",
              "ix_composite"
            ],
            "key": "ux_covering",
            "used_key_parts": [
              "mLongId",
              "siteId"
            ],
            "key_length": "1028",
            "ref": [
              "st.r.mLongId",
              "const"
            ],
            "rows_examined_per_scan": 1,
            "rows_produced_per_join": 650006,
            "filtered": "50.00",
            "using_index": true,
            "cost_info": {
              "read_cost": "1287331.58",
              "eval_cost": "130001.26",
              "prefix_cost": "1965128.98",
              "data_read_per_join": "654M"
            },
            "used_columns": [
              "id",
              "mLongId",
              "siteId",
              "movieId",
              "catalogId",
              "uId"
            ],
            "attached_condition": "(`st`.`t`.`movieId` is not null)"
          }
        }
      ]
    }
  }
}

Best Answer

r: INDEX(siteId, dt)
t: INDEX(siteId, mLongId, movieId)

What datatype is dt? If it is DATETIME, then r.dt = '2021-01-01' only checks for midnight. If you really wanted all day, then do

WHERE ...
  AND r.dt >= '2021-01-01'
  AND r.dt  < '2021-01-01' + INTERVAL 1 DAY

If it is datatype DATE then your formulation is OK. (And my formulation and index are not broken.)

(My suggested index is already set up for this change.)

What is the table PARTITIONed on? (Please provide SHOW CREATE TABLE for each table; there may be other issues to investigate.) I see "func" in the Explain; that sounds 'bad'; the SHOWs should clarify what is going on.

"Filesort" and "temporary" are symptoms of a problem; they are not necessarily the villains. My suggestions are designed to speed up the query; I don't now whether they will eliminate them.

The order of the columns in an index is important in helping the columns be used. As soon as a "range" is hit, no further "used_key_parts" will be used. IS NOT NULL is a range test.

More

Datatypes should be the same in both tables when JOINing. mLongId can't use an index untile you get the COLLATION to be the same. (Numeric consistency is less critical -- cf siteId.)

In general, it not efficient to start any index with the column being PARTITIONed on. (I am thinking of dt.) Partition pruning is already doing some filtering on the column, so you may as well start the index with something else.

If t.id is not used anywhere else, get rid of it and PRIMARY KEY, then update the UNIQUE.mLongId to be the PK.

So far, I don't see any benefit in the partitioning (versus what could be done with just indexes). Perhaps some other queries benefit from partitioning?

Looking at just this:

ON   r.mLongId = t.mLongId
and  r.siteId = t.siteId

Since t.mLongId is UNIQUE, if the Optimizer decides to start with r, then when it moves on to t, mLongId will specify only one row. The test on siteId seems unnecessary. (Actually, all it can do is filter out the row if siteId does not match.) My point is: There may be a bug in the query.