Mysql – inner join on PK with extra criteria slow despite indices

innodbjoin;MySQLperformancequery-performance

Given the two tables below I am struggling to understand:

  • why is the third query slow even though first two queries are fast
  • what exactly is EXPLAIN saying
  • can I do anything to significantly speed up the slow query?

joining two tables on PK is fast:

mysql> select sql_no_cache p.id, sv.postProcessed 
       from product_views p, site_visits sv 
       where p.siteVisitId=sv.id 
       limit 1;
+----+---------------+
| id | postProcessed |
+----+---------------+
|  1 |             1 |
+----+---------------+
1 row in set (0.10 sec)

so is just selecting PVs by timestamp range:

mysql> select sql_no_cache p.id, p.timestamp 
       from product_views p 
       where p.timestamp >= "2012-10-10" 
         and p.timestamp < "2012-11-10" 
       limit 1;
+-----------+---------------------+
| id        | timestamp           |
+-----------+---------------------+
| 501719231 | 2012-10-10 00:01:03 |
+-----------+---------------------+
1 row in set (0.56 sec)

but joining the two is really slow (takes 5 min+ ):

mysql> select sql_no_cache p.id, p.timestamp, sv.postProcessed 
       from product_views p, site_visits sv 
       where p.siteVisitId=sv.id 
         and p.timestamp >= "2012-10-10" 
         and p.timestamp < "2012-11-10" 
       limit 1;

here's the EXPLAIN

mysql> explain select sql_no_cache p.id, p.timestamp, sv.postProcessed from product_views p, site_visits sv where p.siteVisitId=sv.id and p.timestamp >= "2012-10-10" and p.timestamp < "2012-11-10" limit 1;
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| id | select_type | table | type   | possible_keys                                              | key                | key_len | ref                 | rows      | Extra                    |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
|  1 | SIMPLE      | p     | index  | FK52C29B1E3CAB9CC4,timestamp_idx,siteVisitId_timestamp_idx | FK52C29B1E3CAB9CC4 | 8       | NULL                | 119195469 | Using where; Using index |
|  1 | SIMPLE      | sv    | eq_ref | PRIMARY                                                    | PRIMARY            | 8       | clabs.p.siteVisitId |         1 |                          |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
2 rows in set (0.10 sec)

Questions

  • i was expecting the last query to run approx as quickly as the first 2 added together: 1) identify a product_view within given timestamp and 2) do a constant lookup on matching site_visit row. There are < 95m rows in product_views within that timestamp range, not sure why 120M are being scanned…
  • the explain above seems to say that 'timestamp_idx' wasn't used. why not? (I guess mysqld is doing a full partition scan for product_views matching by timestamp)
  • i tried adding a (siteVisitId, timestamp) index to cover all properties used in 'WHERE' – but that's not getting used either. Why?
  • what can I do to speed things up?

Notes on our db:

  • both tables are 100M+ rows
  • every product_view has an exactly one siteVisit. (FK was removed to accomodate InnoDB partitioning constraints)
  • using mysql 5.5
  • no other traffic against db server

TABLES

mysql> show create table site_visits\G
*************************** 1. row ***************************
       Table: site_visits
Create Table: CREATE TABLE `site_visits` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `postProcessed` tinyint(1) NOT NULL,
  `siteVisitState` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `post_processed_idx` (`postProcessed`),
  KEY `visit_state_idx` (`siteVisitState`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1


mysql> show create table product_views\G
*************************** 1. row ***************************
       Table: product_views
Create Table: CREATE TABLE `product_views` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `timestamp` datetime NOT NULL,
  PRIMARY KEY (`id`,`timestamp`),
  KEY `FK52C29B1E3CAB9CC4` (`siteVisitId`),
  KEY `rebateSearchZipCode_idx` (`rebateSearchZipCode`),
  KEY `siteVisitId_timestamp_idx` (`siteVisitId`,`timestamp`),
  KEY `timestamp_idx` (`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
/*!50500 PARTITION BY RANGE  COLUMNS(`timestamp`)
(PARTITION p0 VALUES LESS THAN ('2012-05-01') ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN ('2012-06-01') ENGINE = InnoDB,
 PARTITION p2 VALUES LESS THAN ('2012-07-01') ENGINE = InnoDB,
 PARTITION p3 VALUES LESS THAN ('2012-08-01') ENGINE = InnoDB,
/* partition declarations truncated */
 PARTITION p33 VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB) */

Best Answer

The optimizer does not see that your conditions are correlated and picks the wrong access method.

Basically, it considers two options:

  1. Scan the index on siteVisitId until the first match on site_visits and the first satisfied timestamp condition.

  2. Scan the index on timestamp until the first match on site_visits.

Since timestamp is a part of the primary key and siteVisitId is not, the second plan would involve table lookups on product_views which is several times more slow than a pure index scan (note Using index in the plan).

The optimizer calculates the conditional probability of the timestamp condition being satisfied (given that a corresponding site_visit record exists) and compares it to the overhead of the table access.

Since your timestamp condition is quite wide (as seen on the index histograms), the optimizer prefers the first method.

However, since both siteVisitId and timestamp are incremental, they are correlated and the conditional probability of both matches is not a mere product of their independent probabilities.

In simple words, you have to filter through many low siteVisitId until you find the first matching timestamp, which is exactly what is happening to your query.

You should add ORDER BY timestamp to your query to make the timestamp index cheaper as it won't have to sort. It would also help to create an index on timestamp, siteVisitId (in this order) to avoid table lookups.