Given the two tables below I am struggling to understand:
- why is the third query slow even though first two queries are fast
- what exactly is EXPLAIN saying
- can I do anything to significantly speed up the slow query?
joining two tables on PK is fast:
mysql> select sql_no_cache p.id, sv.postProcessed
from product_views p, site_visits sv
where p.siteVisitId=sv.id
limit 1;
+----+---------------+
| id | postProcessed |
+----+---------------+
| 1 | 1 |
+----+---------------+
1 row in set (0.10 sec)
so is just selecting PVs by timestamp range:
mysql> select sql_no_cache p.id, p.timestamp
from product_views p
where p.timestamp >= "2012-10-10"
and p.timestamp < "2012-11-10"
limit 1;
+-----------+---------------------+
| id | timestamp |
+-----------+---------------------+
| 501719231 | 2012-10-10 00:01:03 |
+-----------+---------------------+
1 row in set (0.56 sec)
but joining the two is really slow (takes 5 min+ ):
mysql> select sql_no_cache p.id, p.timestamp, sv.postProcessed
from product_views p, site_visits sv
where p.siteVisitId=sv.id
and p.timestamp >= "2012-10-10"
and p.timestamp < "2012-11-10"
limit 1;
here's the EXPLAIN
mysql> explain select sql_no_cache p.id, p.timestamp, sv.postProcessed from product_views p, site_visits sv where p.siteVisitId=sv.id and p.timestamp >= "2012-10-10" and p.timestamp < "2012-11-10" limit 1;
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| 1 | SIMPLE | p | index | FK52C29B1E3CAB9CC4,timestamp_idx,siteVisitId_timestamp_idx | FK52C29B1E3CAB9CC4 | 8 | NULL | 119195469 | Using where; Using index |
| 1 | SIMPLE | sv | eq_ref | PRIMARY | PRIMARY | 8 | clabs.p.siteVisitId | 1 | |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
2 rows in set (0.10 sec)
Questions
- i was expecting the last query to run approx as quickly as the first 2 added together: 1) identify a product_view within given timestamp and 2) do a constant lookup on matching site_visit row. There are < 95m rows in product_views within that timestamp range, not sure why 120M are being scanned…
- the explain above seems to say that 'timestamp_idx' wasn't used. why not? (I guess mysqld is doing a full partition scan for product_views matching by timestamp)
- i tried adding a (siteVisitId, timestamp) index to cover all properties used in 'WHERE' – but that's not getting used either. Why?
- what can I do to speed things up?
Notes on our db:
- both tables are 100M+ rows
- every product_view has an exactly one siteVisit. (FK was removed to accomodate InnoDB partitioning constraints)
- using mysql 5.5
- no other traffic against db server
TABLES
mysql> show create table site_visits\G
*************************** 1. row ***************************
Table: site_visits
Create Table: CREATE TABLE `site_visits` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`postProcessed` tinyint(1) NOT NULL,
`siteVisitState` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `post_processed_idx` (`postProcessed`),
KEY `visit_state_idx` (`siteVisitState`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1
mysql> show create table product_views\G
*************************** 1. row ***************************
Table: product_views
Create Table: CREATE TABLE `product_views` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`timestamp` datetime NOT NULL,
PRIMARY KEY (`id`,`timestamp`),
KEY `FK52C29B1E3CAB9CC4` (`siteVisitId`),
KEY `rebateSearchZipCode_idx` (`rebateSearchZipCode`),
KEY `siteVisitId_timestamp_idx` (`siteVisitId`,`timestamp`),
KEY `timestamp_idx` (`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
/*!50500 PARTITION BY RANGE COLUMNS(`timestamp`)
(PARTITION p0 VALUES LESS THAN ('2012-05-01') ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN ('2012-06-01') ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN ('2012-07-01') ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN ('2012-08-01') ENGINE = InnoDB,
/* partition declarations truncated */
PARTITION p33 VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB) */
Best Answer
The optimizer does not see that your conditions are correlated and picks the wrong access method.
Basically, it considers two options:
Scan the index on
siteVisitId
until the first match onsite_visits
and the first satisfiedtimestamp
condition.Scan the index on
timestamp
until the first match onsite_visits
.Since
timestamp
is a part of the primary key andsiteVisitId
is not, the second plan would involve table lookups onproduct_views
which is several times more slow than a pure index scan (noteUsing index
in the plan).The optimizer calculates the conditional probability of the
timestamp
condition being satisfied (given that a correspondingsite_visit
record exists) and compares it to the overhead of the table access.Since your
timestamp
condition is quite wide (as seen on the index histograms), the optimizer prefers the first method.However, since both
siteVisitId
andtimestamp
are incremental, they are correlated and the conditional probability of both matches is not a mere product of their independent probabilities.In simple words, you have to filter through many low
siteVisitId
until you find the first matchingtimestamp
, which is exactly what is happening to your query.You should add
ORDER BY timestamp
to your query to make thetimestamp
index cheaper as it won't have to sort. It would also help to create an index ontimestamp, siteVisitId
(in this order) to avoid table lookups.