Mysql – Simplify and optimize a complex query

indexindex-tuningMySQLperformancequery-performance

Our site has a chronological event feed that uses MySQL. We have noticed that some of the queries tend to run for quite a bit longer than others. For example, some queries for our heaviest users take 250ms, while some the queries for users who don't have that many events take 30 seconds.

CREATE TABLE `events` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `enactor_id` int(11) unsigned NOT NULL,
  `enactor_type` varchar(20) NOT NULL DEFAULT '',
  `subject_id` int(11) unsigned NOT NULL,
  `subject_type` varchar(20) NOT NULL DEFAULT '',
  `result_id` int(11) unsigned NOT NULL,
  `result_type` varchar(20) NOT NULL DEFAULT '',
  `date_updated` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `date_created` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`),
  KEY `enactor_id` (`enactor_id`,`enactor_type`),
  KEY `subject_id` (`subject_id`,`subject_type`),
  KEY `result_id` (`result_id`,`result_type`),
  KEY `date_created` (`date_created`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 

Here is a query that showed up in our slow logs (it ran for 35 seconds):

SELECT `events`.`id`
FROM (`events`)
WHERE (
(
    `subject_id` in (235799, 23987, 294828746, 234897224, 23429847) AND
    `enactor_type` in ('User') AND
    `subject_type` in ('Product') AND
    `result_type` in ('Comment')
) OR (
    `result_id` in (1503) AND
    `enactor_type` in ('User') AND
    `result_type` in ('Search')
) OR (
    `subject_id` in (3523) AND
    `subject_type` in ('User')
))
AND `events`.`date_created` <=  '2012-12-13 19:44:48'
ORDER BY `events`.`id` desc
LIMIT 100;

Explain shows:

id  select_type table   type     possible_keys                      key      key_len  ref   rows  Extra
1   SIMPLE      events  index    subject_id,result_id,date_created  PRIMARY  4        NULL  200   Using where

As a reference, this query completes in 70 ms:

SELECT `events`.`id`
FROM (`events`)
WHERE (
(
    `subject_id` in (1234, 23876234, 234234, 232, 2342, 23424234, 2, 456456, 567, 56756756, 567567) AND
    `enactor_type` in ('User') AND
    `subject_type` in ('Product') AND
    `result_type` in ('Comment')
) OR (
    `enactor_id` in (879,486,11154) AND
    `result_type` != "Search" AND
    `result_type` != "Recommendation" AND
    !((`result_type` = "Product" AND `subject_type` = "Store")) AND
    `enactor_type` in ('User')
) OR (
    `subject_id` in (29) AND
    `enactor_type` in ('User') AND
    `subject_type` in ('Store') AND
    `result_type` in ('Product')
) OR (
    `subject_id` in (285) AND
    `subject_type` in ('User')
))
AND `events`.`date_created` <=  '2012-12-13 22:12:47'
ORDER BY `events`.`id` desc
LIMIT 100;

The events table has ~30 million rows. The database server has 15GB of memory, and the events table is ~1GB. I have to imagine that some sort of indexing will cure this, but I have been playing around with different index choices to no avail.

Edited #1:

@Erwin Smout – You bring up a good point. For users without very many matching rows, the DB will need to go through all 30 million. That would explain the reason that heavy users have faster queries than new users or non-heavy users. The question, then, is how to speed up the worst case scenario, right?

@Pieter B, the indexes are listed in the create table definition above.

@ssmusoke, our site has a following model like many "social" sites. The events table is basically a chronological table of actions that have been taken on the site, and that query pulls events of different types that are relevant to the user based on who that user follows. I will look into changing the *_type's to tinyints to see what kind of performance gains that garners.

Best Answer

What does that "LIMIT 100" do in a query? What might its effect be for "heavy users" (who have lots of matching rows in the database?), and what might its effect be for "non-heavy users" (who will not have that many matching rows in the database?)? How many rows will need to be inspected by the system before the 100 LIMIT is reached for "heavy users", and how many rows will need to be inspected by the system before the 100 LIMIT is reached for "non-heavy users"? How many rows will have to be inspected by the system if a user doesn't even have the full 100 matching rows in the database?

"The question, then, is how to speed up the worst case scenario, right?"

The answer to that is that you either have to try to make more "intelligent" estimates of your hardcoded number 100, or you could try to rewrite your OR predicates as UNION queries, hoping that it's your engine's poor optimizer who doesn't do that automagically (and provided that all individual parts of those UNION queries can be served rapidly by existing indexes - if your predicates are generated dynamically, then you'll have to index as good as every column that could ever possibly appear in your WHERE clause). (Disclaimer : I don't know about what optimizations your engine does and does not do.)