Mysql – Query optimization : abnormal number of row scanned

indexMySQLoptimization

Here is a table with few millions of data (13701956 rows, 588Mo of data):

CREATE TABLE `countdata_681` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `data` int(11) DEFAULT NULL,
  `date_time` datetime DEFAULT NULL,
  `gmt_offset` smallint(6) NOT NULL DEFAULT '0' COMMENT 'Offset GMT en minute',
  `measurement_id` int(11) NOT NULL,
  `flags` bit(16) NOT NULL DEFAULT b'0' COMMENT 'mot binaire : b1000=validé, b10000000=supprimé',
  PRIMARY KEY (`id`),
  UNIQUE KEY `ak_countdata_idx` (`measurement_id`,`date_time`,`gmt_offset`),
  CONSTRAINT `fk_countdata_measurement_681` FOREIGN KEY (`measurement_id`)
        REFERENCES `eco`.`measurement` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=14352166 DEFAULT CHARSET=utf8 COMMENT='données de comptage'

In this table, I put some traffic information on about 90 different measurement_id and try to identify a max(date_time) and MIN(date_time) for every measurement_id.
So I use the query below :

SELECT D.`measurement_id`,
       MAX(D.date_time) AS maxdate,
       MIN(D.date_time) AS mindate
FROM data_us.countdata_681 D
GROUP BY  D.`measurement_id`

If I use the index ak_countdata_idx, I was thinking that I will need to scan 90 rows for MAX and 90 rows for MIN so about 180rows.
An EXPLAIN of this query give me (scan 28548 rows):

id | select_type | table | type  | possible_keys    | key              | key_len | ref    | rows  | Extra                     
------  -----------  ------  ------  ----------------  ----------------  -------  ------  ------  --------------------------
 1 | SIMPLE      | D     | range | ak_countdata_idx | ak_countdata_idx | 10      | (NULL) | 28548 | Using index for group-by

Why this result? Have you got any suggestion to optimize this query?

EDIT :
Here is the second table I update with this request :
SELECT take 0.05s but UPDATE can take 30s sometimes.

UPDATE eco.`measurement` M
JOIN 
(SELECT D.`measurement_id`,
     MAX(D.date_time) AS maxdate,
     MIN(D.date_time) AS mindate
FROM data_au.countdata_641 D
GROUP BY  D.`measurement_id` ) t
ON M.`id` = t.measurement_id SET M.`last` = t.maxdate,
     M.`begin` = IF(M.`begin` < t.mindate,
     M.`begin`,
     t.mindate)

and the table (120993 rows):

 CREATE TABLE `measurement` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `flow_id` int(11) DEFAULT NULL COMMENT 'null si pas encore rattaché à un site',
  `detector_id` int(11) DEFAULT NULL COMMENT 'rappel: détecteur = externalsystem ou ecosystem ou utilisateur Aladdin',
  `file_id` int(11) DEFAULT NULL COMMENT '!null si importation par fichier',
  `begin` datetime DEFAULT NULL,
  `end` datetime DEFAULT NULL COMMENT 'null si measurement active du detector_id',
  `last` datetime DEFAULT NULL COMMENT 'dernier datetime de countdata (maj visiosync)',
  `domain_id` int(11) NOT NULL COMMENT 'xxxsystem.domain_id pour cette mesure (nécessaire pour éviter le risque de perdre la table countdata)',
  `timezone_id` int(11) NOT NULL COMMENT 'domain.timezone_id ou flow.site.timezone_id pour cette mesure (défini à l''attachement)',
  `timebase` int(11) NOT NULL DEFAULT '0' COMMENT 'périodicité des relevés en seconde (maj visiosync) 0=indéterminé',
  `priority` int(11) NOT NULL DEFAULT '0' COMMENT 'classement des mesures',
  `edited` timestamp NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'date de dernière modification des données (sauf ajout après measurement.last)',
  `lastmodified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `fk_measurement_flow_idx` (`flow_id`),
  KEY `fk_measurement_detector_idx` (`detector_id`),
  KEY `fk_measurement_domain_idx` (`domain_id`),
  KEY `fk_measurement_timezone_idx` (`timezone_id`),
  KEY `measurement_begin_idx` (`begin`),
  KEY `measurement_last_idx` (`last`),
  KEY `measurement_end_idx` (`end`),
  CONSTRAINT `fk_measurement_detector` FOREIGN KEY (`detector_id`) REFERENCES `detector` (`id`),
  CONSTRAINT `fk_measurement_domain` FOREIGN KEY (`domain_id`) REFERENCES `domain` (`id`),
  CONSTRAINT `fk_measurement_flow` FOREIGN KEY (`flow_id`) REFERENCES `flow` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
  CONSTRAINT `fk_measurement_timezone` FOREIGN KEY (`timezone_id`) REFERENCES `timezone` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3109075 DEFAULT CHARSET=utf8 COMMENT='mesure'

EDIT 2 :

EXPLAIN UPDATE eco.`measurement` M
JOIN 
    (SELECT D.`measurement_id`,
     MAX(D.date_time) AS maxdate,
     MIN(D.date_time) AS mindate
FROM data_us.countdata_681 D
GROUP BY  D.`measurement_id` ) t
ON M.`id` = t.measurement_id 
SET M.`last` = t.maxdate,
     M.`begin` = IF(M.`begin` < t.mindate,
     M.`begin`,
     t.mindate);

    id  select_type  table       type    possible_keys     key               key_len  ref                 rows  Extra                     
------  -----------  ----------  ------  ----------------  ----------------  -------  ----------------  ------  --------------------------
     1  PRIMARY      <derived2>  ALL     (NULL)            (NULL)            (NULL)   (NULL)             28514  (NULL)                    
     1  PRIMARY      M           eq_ref  PRIMARY           PRIMARY           4        t.measurement_id       1  (NULL)                    
     2  DERIVED      D           range   ak_countdata_idx  ak_countdata_idx  10       (NULL)             28514  Using index for group-by 

EDIT 3 :
It seems to be an applicative issue. I open one transaction which create table and alter table to add my index. At a moment, we ALTER TABLE after the data insertion to limit the time of INSERT but finally, with 13M of data, the alter table became very poor performance. If I directly create my table with index, will it solve my issue?
I have after a multi-thread script with a connection pool which populate my countdata table (finally, does this script really populate the index file?).
At the end of my script, I take again my initial transaction and execute my UPDATE which do a full scan.
Can it the stats table from optimizer which are not yet calculate or a tmp table for the alter until commit of my transaction?

Thanks in advance for your help

Best Answer

Question. Does the query runs in an acceptable time? If so, improving this query is non trivial, so its up to you to pursue further.

Moving on

The number you are seeing is the estimated number of distinct values that MySQL will find for

(`measurement_id`,`date_time`,`gmt_offset`)
or
(`measurement_id`,`date_time`)

given that you already mentioned that there are 90 different measurement_id

This is called selectivity. and on EXPLAIN output, its not the real number of rows read effectively. Its an estimated number of rows, and it comes from the optimizer and the statistics collected so far on these columns/indexes. In case of MySQL, you must look for the session performance data to get the real numbers for effective processing work done after running the query.

Could you please run the following commands?

select count(*) from 
  (select `measurement_id`,`date_time`,`gmt_offset`
   group by 
   `measurement_id`,`date_time`,`gmt_offset`) as abc

and select count(*) from 
  (select `measurement_id`,`date_time`
   group by 
   `measurement_id`,`date_time`) as abc

The output of one of these must come close or match that number

The issue of a greater number also boils down on how indexes are stored/work, specially b-trees

MySQL can do reverse scans for MAX as far as i know, but im unsure that it will be able to traverse the index without touching at all the other historical rows and leaf rows in the index. Remember that both Tables and Indexes are linked lists ( a b-tree index is a linked list too)

Did you tried just for MAX? which is the result? Running just for MAX reached the time you need?

If so, you may split up this in 2 queries. Running for both ends could be forcing MySQL to scan everything. Given the following report, i could understand it can do the split up, but its never bad to test. https://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html

My point of view while designing tables is that you dont mix current data with historical data.

Given the amount of data you already have, i can tell that this is already on production.

If this query speed is the utmost you need i could suggest: 1)

Create the following table that you will need to update for each row inserted on the original table.

CREATE TABLE `summ_data` (
  `measurement_id` int(11) NOT NULL,
  `first_date_time` datetime NOT NULL,
  `last_date_time` datetime NOT NULL,
  PRIMARY KEY (`measurement_id`))

or

CREATE TABLE `summ_data` (
  `measurement_id` int(11) NOT NULL,
  `first_id` int(11) NOT NULL,
  `last_id` int(11) NOT NULL,
  PRIMARY KEY (`measurement_id`))

or

2) Create indexes with only the columns you need. There is already a good covering index (the alternate key), so, this index is redundant, but more perfomant for the very specific given query

`idx_measure_date` (`measurement_id`,`date_time`)

EDIT

I am supposing that you need this information as fast as you could, given the answers provided

1) Do create the index below as already recommended. It would be nice to have an idea of how much time it can shave to summarize.

idx_measure_date (measurement_id,date_time)

The index above may be removed after test if proceeding below.

2) Although not the best solution in my POV (point of view), you could add update and insert triggers to the countdata_681 table to perform the last(max) date update using the PKs. Given that you will be handling mostly poinpointed updates, it should be fast. While pursuing this way i do strongly recommend:

  • This is a summarized online table. I do not recommend at all to use referential integrity on this table. Given that you can recreate the data in 30 secs, its not something that i would put more code and handling to bloat it.

  • If this information is only needed on stats panels, i do recommend to use the MEMORY storage engine. The upside it that the table will be as fast as possible, but this table will need to be recreated and populated at every database restart/crash. Check it out.

  • If your application expects no more than 100 measurement_ids, partition this table using list partitioning. One for each measurement_id. This will split the effort to update the data for each measurement, improving concurrency. https://dev.mysql.com/doc/refman/5.7/en/partitioning-list.html

  • Do you really need this info updated in less than 3 minutes? Is it businesswise to accept a delay on this info?