How to Optimize MySQL for Low Cardinality When Majority of Rows Are Needed

MySQLmysql-5.6

I currently have a table with 3.2kk records, index cardinality is around 1000.

It's data from google search console, so 5000~ records per day for 5000~ different queries, and I need to run reports based on report date rather than query. If I run a report for 1-2 months, indexes are used (17% of total rows is about the cut off). The data for each query needs to be aggregated and then compared so I can sort by averages of the metrics (impressions, clicks, ranking, etc.).

Partitioning the table by range cut the query time in half, but if I need to pull in 50+% of the rows my queries are taking 30+ seconds.

I've considered making a summary table (and I'm starting to think this is the way to go if I need to keep it in a mysql db), and storing a months worth of data per query in a single row. So if I need 2017-10-15 -> 2018-10-15 I would query the summary table for 2017-11 -> 2018-09 and then the main table from 2017-10-01 -> 2017-10-15 U 2018-10-01 -> 2018-10-15.

This table is joined on another tabled just to get "favorite" queries, other than that it is not joined on anything, so dumping the relational aspect of it may be worth it if it means an increase in speed.

More than anything I'm just curious if there's something I'm not aware of with mysql that I could be utilizing, but I imagine there's not. Unfortunately I'm not well versed in databases, so I'm also not aware of the potentially better options in terms of the technologies available, so I'm definitely open to expanding my knowledge in the realm of databases.

CREATE TABLE `a` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `reportDate` date NOT NULL,
  `clicks` int(11) NOT NULL COMMENT 'metric',
  `impressions` int(11) NOT NULL COMMENT 'metric',
  `position` double(10,2) NOT NULL COMMENT 'metric',
  `query` varchar(300) NOT NULL COMMENT 'dimension',
  `queryId` int(11) unsigned NOT NULL,
  `page` text NOT NULL COMMENT 'dimension',
  PRIMARY KEY (`id`),
  KEY `query` (`reportDate`,`queryId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

explain:

explain SELECT SQL_NO_CACHE a.query                 AS  keyword, 
       Coalesce(Sum(impressions), 0)                AS  impressions, 
       Coalesce(Sum(clicks), 0)                     AS  clicks, 
       Round(Coalesce(( Sum(clicks) / Sum(impressions) ) * 100, 0), 2)    
                                                    AS  clickThroughRate, 
       Round(Coalesce(Sum(position) / Sum(IF(position = 0, 0, 1)), 0), 2)
                                                    AS  avgPosition, 
       Coalesce(b.starred, 0)                       AS  starred 
FROM   a
       LEFT JOIN b 
              ON b.queryid = a.queryid 
                 AND b.query = a.query 
WHERE  a.reportDate >= '2018-05-01'
       AND a.reportDate <= '2018-10-25'
GROUP  BY a.query 
ORDER  BY impressions DESC 
LIMIT  0, 30;

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  a   ALL reportdate  NULL    NULL    NULL    3252967 Using where; Using temporary; Using filesort

Partitioned test table:

CREATE TABLE `a` (
  `reportdate` date NOT NULL,
  `clicks` int(11) NOT NULL COMMENT 'metric',
  `impressions` int(11) NOT NULL COMMENT 'metric',
  `position` double(10,2) NOT NULL COMMENT 'metric',
  `query` varchar(300) NOT NULL COMMENT 'dimension',
  `queryId` int(11) unsigned NOT NULL,
  `quarteryear` mediumint(6) unsigned NOT NULL,
  KEY `reportdate` (`reportdate`,`queryId`,`quarteryear`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (`quarteryear`)
(PARTITION `start` VALUES LESS THAN (0) ENGINE = InnoDB,
 PARTITION before20160101 VALUES LESS THAN (201601) ENGINE = InnoDB,
 PARTITION before20160601 VALUES LESS THAN (201606) ENGINE = InnoDB,
 PARTITION before20170101 VALUES LESS THAN (201701) ENGINE = InnoDB,
 PARTITION before20170301 VALUES LESS THAN (201703) ENGINE = InnoDB,
 PARTITION before20170401 VALUES LESS THAN (201705) ENGINE = InnoDB,
 PARTITION before20170701 VALUES LESS THAN (201707) ENGINE = InnoDB,
 PARTITION before20170901 VALUES LESS THAN (201709) ENGINE = InnoDB,
 PARTITION before20171101 VALUES LESS THAN (201711) ENGINE = InnoDB,
 PARTITION before20180101 VALUES LESS THAN (201801) ENGINE = InnoDB,
 PARTITION before20180301 VALUES LESS THAN (201803) ENGINE = InnoDB,
 PARTITION before20180401 VALUES LESS THAN (201805) ENGINE = InnoDB,
 PARTITION before20180701 VALUES LESS THAN (201807) ENGINE = InnoDB,
 PARTITION before20180901 VALUES LESS THAN (201809) ENGINE = InnoDB,
 PARTITION before20181101 VALUES LESS THAN (201811) ENGINE = InnoDB,
 PARTITION before20190101 VALUES LESS THAN (201901) ENGINE = InnoDB,
 PARTITION future VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */;`

Explain:

explain partitions SELECT SQL_NO_CACHE a.query                                         AS 
       keyword, 
       Coalesce(Sum(impressions), 0)                                      AS 
       impressions, 
       Coalesce(Sum(clicks), 0)                                           AS 
       clicks, 
       Round(Coalesce(( Sum(clicks) / Sum(impressions) ) * 100, 0), 2)    AS 
       clickThroughRate, 
       Round(Coalesce(Sum(position) / Sum(IF(position = 0, 0, 1)), 0), 2) AS 
       avgPosition, 
       Coalesce(b.starred, 0)                            AS 
       starred
FROM   a
       LEFT JOIN b 
              ON b.queryid = a.queryid 
                 AND b.query = a.query 
  WHERE  a.reportdate >= '2018-05-01'
       AND a.reportdate <= '2018-10-25'
AND a.quarteryear <= 201810
       AND a.quarteryear >= 201710  
GROUP  BY a.query 
ORDER  BY impressions DESC 
LIMIT  0, 30;`


`id select_type table   partitions  type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  a   before20171101,before20180101,before20180301,before20180401,before20180701,before20180901,before20181101    ALL reportdate  NULL    NULL    NULL    1673926 Using where; Using temporary; Using filesort
1   SIMPLE  b   NULL    eq_ref  PRIMARY PRIMARY 56  J1001.a.queryId,J1001.a.query   1   Using where

Update: Dimension table with auto increment id to link query

CREATE TABLE `dataTable` (
  `id` INT(11) unsigned NOT NULL AUTO_INCREMENT,
  `reportdate` date NOT NULL,
  `clicks` mediumint(11) NOT NULL COMMENT 'metric',
  `impressions` mediumint(11) NOT NULL COMMENT 'metric',
  `position` double(10,2) NOT NULL COMMENT 'metric',
  `queryId` mediumint(8) unsigned NOT NULL,
  `Device` enum('DESKTOP','MOBILE','TABLET') NOT NULL,
  PRIMARY KEY (`reportdate`,`id`),
  index (`id`),
  FOREIGN KEY (`queryId`) references settingsTable(`id`) ###auto increment primary id on settings table
);
CREATE TABLE `settingsTable` (
  `id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
  `query` varchar(300) NOT NULL COMMENT 'dimension',
  `queryId` bigint(20) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `queryId` (`queryId`)
) ENGINE=InnoDB AUTO_INCREMENT=65536 DEFAULT CHARSET=utf8;

SELECT SQL_NO_CACHE settingsTable.query                                         AS 
      keyword, 
      Coalesce(Sum(impressions), 0)                                      AS 
      impressions, 
      Coalesce(Sum(clicks), 0)                                           AS 
      clicks, 
      Round(Coalesce(( Sum(clicks) / Sum(impressions) ) * 100, 0), 2)    AS 
      clickThroughRate, 
      Round(Coalesce(Sum(position) / Sum(IF(position = 0, 0, 1)), 0), 2) AS 
      avgPosition
FROM   dataTable left join
   settingsTable on dataTable.queryid = settingsTable.id
WHERE  dataTable.reportdate >= '2018-01-25'
      AND dataTable.reportdate <= '2018-10-25'
GROUP  BY dataTable.queryid
ORDER  BY impressions DESC 
LIMIT  0, 30;

explain:
id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  dataTable   range   PRIMARY PRIMARY 3   NULL    1584326 Using where; Using temporary; Using filesort
1   SIMPLE  settingsTable   eq_ref  PRIMARY PRIMARY 3   DB.dataTable.queryId    1   NULL

Best Answer

PARTITIONing does not necessarily help performance. Certainly now in this case, since you are not "pruning" on quarteryear. You could do the pruning if you changed to PARTITION BY RANGE(TO_DAYS(reportDate)) and changed the constants in the partition definitions suitably.

Will you be using DROP PARTITION to efficiently remove old rows? That is one of the few reasons to every use PARTITIONing. I discuss it here .

Fix those, plus the dimension for query, then let's make a second pass over the issues with the query.

How many rows in each table? (I am fishing for the impact of the JOIN and GROUP BY; I may want to recommend a different approach.)

Without partitioning...

I predict that if you remove partitioning, remove quarteryear, add id back in, and change to

  PRIMARY KEY(`reportdate`, id)
  INDEX(id)

The query will run faster. (I don't know about other queries.)

Note that having queryid in the index does not help. (See my indexing cookbook.)

Cardinality...

Yes 17% is about where it is expected. I have seen less; I have seen more. The reason is that bouncing between the index and the table is somewhat costly. So, at some point, it is more efficient to simply scan the table. (The Optimizer is imperfect at picking the right cutoff.)

The trick in the above paragraph is to arrange for the rows needed to be consecutive in the data. This was done by putting reportdate first in the PRIMARY KEY. That way, the cardinality issue vanishes.

However, other queries may suffer. Please provide the main queries so I can juggle the tradeoffs.

Summary Tables

But the real speedup will come from building and maintaining Summary Table(s) so that the query does not have to shovel through so many rows.

Correctness of Summary tables... If you never delete or update, or insert after the fact, then it is rather straightforward to assure that the summarization (for 'yesterday', for example) is exact. If you do need to rebuild the summary, have a script handy that does such.

Another building technique is to use an increment (usually with IODKU to insert or increment) to do the increment as you go. Recommend providing a Stored Procedure that does both tasks as one.

Query id as a hash

You were right to change direction. Have only an INT in the current table. Have INT AUTO_INCREMENT, the query, and hash in another table. At that point, the hash may as well be MD5, which is extremely unlikely to have a dup before the computer dies of old age. Indexing any kind of hash in a huge table can be costly, so minimize how often you do it.

See mysqlslow for one way to turn SQL queries into a 'canonical' form. It replaces numbers and strings by ?, thereby whittling down the number of 'different' queries from millions to thousands. (I have done a similar thing when collecting slowlogs from hundreds of servers.)