Mysql – inner join on PK with extra criteria slow despite indices

innodbjoin;MySQLperformancequery-performance

Given the two tables below I am struggling to understand:

why is the third query slow even though first two queries are fast
what exactly is EXPLAIN saying
can I do anything to significantly speed up the slow query?

joining two tables on PK is fast:

mysql> select sql_no_cache p.id, sv.postProcessed 
       from product_views p, site_visits sv 
       where p.siteVisitId=sv.id 
       limit 1;
+----+---------------+
| id | postProcessed |
+----+---------------+
|  1 |             1 |
+----+---------------+
1 row in set (0.10 sec)

so is just selecting PVs by timestamp range:

mysql> select sql_no_cache p.id, p.timestamp 
       from product_views p 
       where p.timestamp >= "2012-10-10" 
         and p.timestamp < "2012-11-10" 
       limit 1;
+-----------+---------------------+
| id        | timestamp           |
+-----------+---------------------+
| 501719231 | 2012-10-10 00:01:03 |
+-----------+---------------------+
1 row in set (0.56 sec)

but joining the two is really slow (takes 5 min+ ):

mysql> select sql_no_cache p.id, p.timestamp, sv.postProcessed 
       from product_views p, site_visits sv 
       where p.siteVisitId=sv.id 
         and p.timestamp >= "2012-10-10" 
         and p.timestamp < "2012-11-10" 
       limit 1;

here's the EXPLAIN

mysql> explain select sql_no_cache p.id, p.timestamp, sv.postProcessed from product_views p, site_visits sv where p.siteVisitId=sv.id and p.timestamp >= "2012-10-10" and p.timestamp < "2012-11-10" limit 1;
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| id | select_type | table | type   | possible_keys                                              | key                | key_len | ref                 | rows      | Extra                    |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
|  1 | SIMPLE      | p     | index  | FK52C29B1E3CAB9CC4,timestamp_idx,siteVisitId_timestamp_idx | FK52C29B1E3CAB9CC4 | 8       | NULL                | 119195469 | Using where; Using index |
|  1 | SIMPLE      | sv    | eq_ref | PRIMARY                                                    | PRIMARY            | 8       | clabs.p.siteVisitId |         1 |                          |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
2 rows in set (0.10 sec)

Questions

i was expecting the last query to run approx as quickly as the first 2 added together: 1) identify a product_view within given timestamp and 2) do a constant lookup on matching site_visit row. There are < 95m rows in product_views within that timestamp range, not sure why 120M are being scanned…
the explain above seems to say that 'timestamp_idx' wasn't used. why not? (I guess mysqld is doing a full partition scan for product_views matching by timestamp)
i tried adding a (siteVisitId, timestamp) index to cover all properties used in 'WHERE' – but that's not getting used either. Why?
what can I do to speed things up?

Notes on our db:

both tables are 100M+ rows
every product_view has an exactly one siteVisit. (FK was removed to accomodate InnoDB partitioning constraints)
using mysql 5.5
no other traffic against db server

TABLES

mysql> show create table site_visits\G
*************************** 1. row ***************************
       Table: site_visits
Create Table: CREATE TABLE `site_visits` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `postProcessed` tinyint(1) NOT NULL,
  `siteVisitState` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `post_processed_idx` (`postProcessed`),
  KEY `visit_state_idx` (`siteVisitState`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1


mysql> show create table product_views\G
*************************** 1. row ***************************
       Table: product_views
Create Table: CREATE TABLE `product_views` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `timestamp` datetime NOT NULL,
  PRIMARY KEY (`id`,`timestamp`),
  KEY `FK52C29B1E3CAB9CC4` (`siteVisitId`),
  KEY `rebateSearchZipCode_idx` (`rebateSearchZipCode`),
  KEY `siteVisitId_timestamp_idx` (`siteVisitId`,`timestamp`),
  KEY `timestamp_idx` (`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
/*!50500 PARTITION BY RANGE  COLUMNS(`timestamp`)
(PARTITION p0 VALUES LESS THAN ('2012-05-01') ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN ('2012-06-01') ENGINE = InnoDB,
 PARTITION p2 VALUES LESS THAN ('2012-07-01') ENGINE = InnoDB,
 PARTITION p3 VALUES LESS THAN ('2012-08-01') ENGINE = InnoDB,
/* partition declarations truncated */
 PARTITION p33 VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB) */

Best Answer

The optimizer does not see that your conditions are correlated and picks the wrong access method.

Basically, it considers two options:

Scan the index on siteVisitId until the first match on site_visits and the first satisfied timestamp condition.
Scan the index on timestamp until the first match on site_visits.

Since timestamp is a part of the primary key and siteVisitId is not, the second plan would involve table lookups on product_views which is several times more slow than a pure index scan (note Using index in the plan).

The optimizer calculates the conditional probability of the timestamp condition being satisfied (given that a corresponding site_visit record exists) and compares it to the overhead of the table access.

Since your timestamp condition is quite wide (as seen on the index histograms), the optimizer prefers the first method.

However, since both siteVisitId and timestamp are incremental, they are correlated and the conditional probability of both matches is not a mere product of their independent probabilities.

In simple words, you have to filter through many low siteVisitId until you find the first matching timestamp, which is exactly what is happening to your query.

You should add ORDER BY timestamp to your query to make the timestamp index cheaper as it won't have to sort. It would also help to create an index on timestamp, siteVisitId (in this order) to avoid table lookups.

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Mysql – Is a update-only-once-row table worth sharding

Is there a problem? Or are you expecting to grow significantly? 10M/day = 120/second, which is high, but not necessarily the limit.

innodb_flush_log_at_trx_commit = 1 is the safest, but it is the slowest. A value of 2 will give you a boost in performance.

Batching INSERTs is also a performance boost; however you may not be able to do so because of how your app works. Also combining statements into a transaction would help -- if practical.

What you have described of your app sounds like all the activity is concentrated near the 'end' of the one table. This implies that there is very little I/O (other than transaction logging, and that was what I was addressing above).

Back to your question...

Sharding across N servers will cut the I/O, locking, CPU, etc by a factor of nearly N.

The process that decides which shard to go to should be another machine, and the shards should be alone on each of the machines.

A problem with sharding to handle growth is what to do when you need more than N machines.

Best Answer

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Mysql – Is a update-only-once-row table worth sharding

Related Question