Mysql – Optimizing an AVG / GROUP BY query

group byMySQL

Here is the query:

SELECT p.domain_id AS comp, AVG(IFNULL(p.position, 50)) as AVG, p.date AS period
    FROM userdomain ud
    LEFT JOIN userdomainexpression ude
        ON ude.userdomain_id=ud.id
    LEFT JOIN expression e
        ON e.id=ude.expression_id
    LEFT JOIN request r
        ON r.expression_id=e.id
    LEFT JOIN position p
        ON p.request_id=r.id
    WHERE ud.id=4
        AND p.domain_id IN (3,8,13,67,6,5,15,17,7,10,9,12,16,1,2,4,11,14)
        AND p.date >= '2014-11-01' AND p.date <= '2015-03-23'
    GROUP BY period, p.domain_id

At the moment it takes about 20 seconds to process. position has about 6 millions rows. Others less than a few thousands.

Here is the EXPLAIN:

id? select_type?    table?  partitions? type?   possible_keys?  key?    key_len?    ref?    rows?   Extra?
1   SIMPLE  ud  NULL    const   PRIMARY PRIMARY 4   const   1   Using index; Using temporary; Using filesort
1   SIMPLE  ude NULL    ref userdomain_id_expression_id,userdomain_id,expression_id userdomain_id_expression_id 5   const   1202    Using where; Using index
1   SIMPLE  e   NULL    eq_ref  PRIMARY PRIMARY 4   hugemetrics_prod.ude.expression_id  1   Using where; Using index
1   SIMPLE  r   NULL    ref PRIMARY,expression_id   expression_id   5   hugemetrics_prod.e.id   1   Using where; Using index
1   SIMPLE  p   NULL    ref request_id_domain_id_date,request_id,domain_id,date request_id_domain_id_date   5   hugemetrics_prod.r.id   462 Using where

When I remove the select AVG, it drops to 3/4 seconds (which is much better but still not great).

EXPLAIN tells me it uses another index for the last line:

id? select_type?    table?  partitions? type?   possible_keys?  key?    key_len?    ref?    rows?   Extra?
1   SIMPLE  ud  NULL    const   PRIMARY PRIMARY 4   const   1   Using index; Using temporary; Using filesort
1   SIMPLE  ude NULL    ref userdomain_id_expression_id,userdomain_id,expression_id userdomain_id_expression_id 5   const   1202    Using where; Using index
1   SIMPLE  e   NULL    eq_ref  PRIMARY PRIMARY 4   hugemetrics_prod.ude.expression_id  1   Using where; Using index
1   SIMPLE  r   NULL    ref PRIMARY,expression_id   expression_id   5   hugemetrics_prod.e.id   1   Using where; Using index
1   SIMPLE  p   NULL    ref request_id_domain_id_date,request_id,domain_id,date request_id_domain_id_date   5   hugemetrics_prod.r.id   462 Using where; Using index

I have indexes on ud.id, e.id, ude.expression_id, r.expression_id, e.id, p.position, p.request_id, r.id, p.domain and p.date.

I'm not sure why avg() prevents the use of index and increases so much the duration, and how to optimize the query to reach <1sec?


As ypercube suggested, the query is now:

SELECT 
      p.domain_id AS comp, 
      AVG(IFNULL(p.position, 50)) AS avg, 
      p.date AS period
FROM 
      userdomainexpression AS ude
    JOIN request r
        ON r.expression_id = ude.expression_id
    JOIN position p
        ON p.request_id = r.id
WHERE 
      ude.userdomain_id = 4
  AND p.domain_id IN (3,8,13,67,6,5,15,17,7,10,9,12,16,1,2,4,11,14)
  AND p.date >= '2014-11-01' AND p.date <= '2015-03-23'
GROUP BY 
      p.date, p.domain_id ;

I think the timing is a bit better but still very slow anyway.

The EXPLAIN:

id? select_type?    table?  partitions? type?   possible_keys?  key?    key_len?    ref?    rows?   Extra?
1   SIMPLE  ude NULL    ref userdomain_id_expression_id,userdomain_id,expression_id userdomain_id_expression_id 5   const   1202    Using where; Using index; Using temporary; Using filesort
1   SIMPLE  r   NULL    ref PRIMARY,expression_id   expression_id   5   hugemetrics_test.ude.expression_id  1   Using where; Using index
1   SIMPLE  p   NULL    ref request_id_domain_id_date,request_id,domain_id,date,date_domain_id_position request_id_domain_id_date   5   hugemetrics_test.r.id   977 Using where

And create table queries:

-- Adminer 4.1.0 MySQL dump

SET NAMES utf8;
SET time_zone = '+00:00';
SET foreign_key_checks = 0;
SET sql_mode = 'NO_AUTO_VALUE_ON_ZERO';

DROP TABLE IF EXISTS `position`;
CREATE TABLE `position` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `date` date DEFAULT NULL,
  `position` int(11) DEFAULT NULL,
  `oneboxPosition` int(11) DEFAULT NULL,
  `page` int(11) DEFAULT NULL,
  `request_id` int(11) DEFAULT NULL,
  `serp_id` int(11) DEFAULT NULL,
  `serp_type` varchar(50) DEFAULT NULL,
  `domain_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `request_id_domain_id_date` (`request_id`,`domain_id`,`date`),
  KEY `request_id` (`request_id`),
  KEY `domain_id` (`domain_id`),
  KEY `date` (`date`),
  KEY `serp_type` (`serp_type`),
  KEY `serp_id` (`serp_id`),
  KEY `oneboxPosition` (`oneboxPosition`),
  KEY `position` (`position`),
  KEY `date_domain_id_position` (`date`,`domain_id`,`position`),
  CONSTRAINT `position_ibfk_1` FOREIGN KEY (`request_id`) REFERENCES `request` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


DROP TABLE IF EXISTS `request`;
CREATE TABLE `request` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `engine` varchar(255) DEFAULT NULL,
  `region` varchar(255) DEFAULT NULL,
  `language` varchar(255) DEFAULT NULL,
  `datacenter` varchar(255) DEFAULT NULL,
  `expression_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `engine_region_language_expression_id` (`engine`,`region`,`language`,`expression_id`),
  KEY `expression_id` (`expression_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


DROP TABLE IF EXISTS `userdomainexpression`;
CREATE TABLE `userdomainexpression` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `userdomain_id` int(11) DEFAULT NULL,
  `expression_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `userdomain_id_expression_id` (`userdomain_id`,`expression_id`),
  KEY `userdomain_id` (`userdomain_id`),
  KEY `expression_id` (`expression_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


-- 2015-03-24 12:40:26

Best Answer

  • A composite index on position (date, domain_id, position) and on userdomainexpression (userdomain_id, expression_id) would help I think.

  • The join to userdomain is not needed.

  • Assuming foreign keys exist, the join to expression could be removed as well.

  • The LEFT joins are actually (equivalent to) INNER joins (due to the WHERE conditions.

  • You could try to rewrite, using GROUP BY only in positions and then join (It might be better or worse, worth a try).

  • Other possible rewrites, using EXISTS or a subquery (with DISTINCT) for the join to userdomainexpression. This might be useful if there is no unique constraint on userdomainexpression (userdomain_id, expression_id).

A suggestion, removing 2 tables from the joins:

SELECT 
      p.domain_id AS comp, 
      AVG(IFNULL(p.position, 50)) AS avg, 
      p.date AS period
FROM 
      userdomainexpression AS ude
    JOIN request r
        ON r.expression_id = ude.expression_id
    JOIN position p
        ON p.request_id = r.id
WHERE 
      ude.userdomain_id = 4
  AND p.domain_id IN (3,8,13,67,6,5,15,17,7,10,9,12,16,1,2,4,11,14)
  AND p.date >= '2014-11-01' AND p.date <= '2015-03-23'
GROUP BY 
      p.date, p.domain_id ;