Mysql – Optimize a query with two range conditions

MySQLoptimizationperformancequery-performancesubquery

I have a structure similar to this one:

CREATE TABLE `author` (
  `id` int(11) unsigned NOT NULL auto_increment,
  `name` varchar(255) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1;

CREATE TABLE `book` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `author_id` int(11) unsigned NOT NULL,
  `org` int(11) unsigned NOT NULL,
  `country` char(3) NOT NULL,
  `publish_date` date NOT NULL,
  `price` decimal(6,2) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `author_id` (`author_id`),
  KEY `publish_date` (`publish_date`),
  KEY `i0` (`country`, `org`, `author_id`, `price`, `publish_date`),
  KEY `i1` (`country`, `org`, `author_id`, `publish_date`, `price`)
) ENGINE=InnoDB CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1;

Here is a SQL Fiddle

So I'm trying to execute a query with two range conditions

SELECT
  id as author_id,
  (SELECT COUNT(DISTINCT `book`.`id`)+1 
   FROM `book` 
   WHERE 
      `book`.`org` = 1
      AND `book`.`country` = 'USA' 
      AND `book`.`publish_date` BETWEEN '2010-04-30' AND '2011-04-30'
      AND `book`.`author_id` = `author`.`id`
      AND `book`.`price` < 50
  ) AS `books_under_fifty`
FROM `author` 
ORDER BY books_under_fifty desc;

but the optimizer uses only part of my index: const,const,db_9_6349e2.author.id from i0

is there a way to optimize it?

Best Answer

Well, let's think out of the box. It seems that there are two sets of results -- the authors with some such books, and those without. The first case is more efficiently done via:

SELECT  author_id,
        COUNT(*)+1 AS books_under_fifty
    FROM  `book`
    WHERE  `org` = 1
      AND  `country` = 'USA'
      AND  publish_date >= '2010-05-01'
      AND  publish_date <  '2010-05-01' + INTERVAL 1 YEAR
      AND  `price` < 50 

That would probably be best served by

INDEX(country, org, publish_date, price, author_id)

If you are happy with the performance of that, then you can think out of the box in finding "the rest of the authors".

Back to the 2-ranges problem. That can sometimes be solved by Partitioning. Not knowing the distribution of the data (is one year a small subset of the total? or what about price?), I can't say which would be better:

PARTITION BY RANGE(TO_DAYS(publish_date))

Partitioning by price is out, since DECIMAL can't be used. Storing price as number of cents and using PARTITION BY RANGE(cents) would work, but clumsily.

Then make 20-50 partitions of the partition key. This would give you "partition pruning" for one 'range', but then the index needs to be tweaked for the other. Assuming you partition by publish_date:

INDEX(country, org, price)
PRIMARY KEY(id, publish_date)

Meanwhile, ...

Don't use utf8 if you are using standard country_codes; use ascii, at least for that column.

What is org does it distinguish the book? Or are there some missing columns? Is there no other "unique" column(s)?

Another variant that might be better:

 PRIMARY KEY(country, org, price,  -- for clustering
             publish_date,   -- because partitioning requires it
             id)           -- to assure uniqueness
 INDEX(id)

This variant assumes you have no 'natural' PK. And it will be more efficient because it clusters the desired rows together. I did not include author_id in the PK since it is readily available in the row.

add index

ADD INDEX ac(author_id, country) tricked it into using i0. Go figure. Note: I left (author_id) intact and first in the list of indexes; so it is not just the order of the indexes.

I'm using Percona's 5.6.22-71.0-log .

Note that there are only 4 different country values, evenly distributed. But there are many different author_id, making it by itself more "selective".