I currently have a table with 3.2kk records, index cardinality is around 1000.
It's data from google search console, so 5000~ records per day for 5000~ different queries, and I need to run reports based on report date rather than query. If I run a report for 1-2 months, indexes are used (17% of total rows is about the cut off). The data for each query needs to be aggregated and then compared so I can sort by averages of the metrics (impressions, clicks, ranking, etc.).
Partitioning the table by range cut the query time in half, but if I need to pull in 50+% of the rows my queries are taking 30+ seconds.
I've considered making a summary table (and I'm starting to think this is the way to go if I need to keep it in a mysql db), and storing a months worth of data per query in a single row. So if I need 2017-10-15 -> 2018-10-15 I would query the summary table for 2017-11 -> 2018-09 and then the main table from 2017-10-01 -> 2017-10-15 U 2018-10-01 -> 2018-10-15.
This table is joined on another tabled just to get "favorite" queries, other than that it is not joined on anything, so dumping the relational aspect of it may be worth it if it means an increase in speed.
More than anything I'm just curious if there's something I'm not aware of with mysql that I could be utilizing, but I imagine there's not. Unfortunately I'm not well versed in databases, so I'm also not aware of the potentially better options in terms of the technologies available, so I'm definitely open to expanding my knowledge in the realm of databases.
CREATE TABLE `a` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`reportDate` date NOT NULL,
`clicks` int(11) NOT NULL COMMENT 'metric',
`impressions` int(11) NOT NULL COMMENT 'metric',
`position` double(10,2) NOT NULL COMMENT 'metric',
`query` varchar(300) NOT NULL COMMENT 'dimension',
`queryId` int(11) unsigned NOT NULL,
`page` text NOT NULL COMMENT 'dimension',
PRIMARY KEY (`id`),
KEY `query` (`reportDate`,`queryId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
explain:
explain SELECT SQL_NO_CACHE a.query AS keyword,
Coalesce(Sum(impressions), 0) AS impressions,
Coalesce(Sum(clicks), 0) AS clicks,
Round(Coalesce(( Sum(clicks) / Sum(impressions) ) * 100, 0), 2)
AS clickThroughRate,
Round(Coalesce(Sum(position) / Sum(IF(position = 0, 0, 1)), 0), 2)
AS avgPosition,
Coalesce(b.starred, 0) AS starred
FROM a
LEFT JOIN b
ON b.queryid = a.queryid
AND b.query = a.query
WHERE a.reportDate >= '2018-05-01'
AND a.reportDate <= '2018-10-25'
GROUP BY a.query
ORDER BY impressions DESC
LIMIT 0, 30;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a ALL reportdate NULL NULL NULL 3252967 Using where; Using temporary; Using filesort
Partitioned test table:
CREATE TABLE `a` (
`reportdate` date NOT NULL,
`clicks` int(11) NOT NULL COMMENT 'metric',
`impressions` int(11) NOT NULL COMMENT 'metric',
`position` double(10,2) NOT NULL COMMENT 'metric',
`query` varchar(300) NOT NULL COMMENT 'dimension',
`queryId` int(11) unsigned NOT NULL,
`quarteryear` mediumint(6) unsigned NOT NULL,
KEY `reportdate` (`reportdate`,`queryId`,`quarteryear`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (`quarteryear`)
(PARTITION `start` VALUES LESS THAN (0) ENGINE = InnoDB,
PARTITION before20160101 VALUES LESS THAN (201601) ENGINE = InnoDB,
PARTITION before20160601 VALUES LESS THAN (201606) ENGINE = InnoDB,
PARTITION before20170101 VALUES LESS THAN (201701) ENGINE = InnoDB,
PARTITION before20170301 VALUES LESS THAN (201703) ENGINE = InnoDB,
PARTITION before20170401 VALUES LESS THAN (201705) ENGINE = InnoDB,
PARTITION before20170701 VALUES LESS THAN (201707) ENGINE = InnoDB,
PARTITION before20170901 VALUES LESS THAN (201709) ENGINE = InnoDB,
PARTITION before20171101 VALUES LESS THAN (201711) ENGINE = InnoDB,
PARTITION before20180101 VALUES LESS THAN (201801) ENGINE = InnoDB,
PARTITION before20180301 VALUES LESS THAN (201803) ENGINE = InnoDB,
PARTITION before20180401 VALUES LESS THAN (201805) ENGINE = InnoDB,
PARTITION before20180701 VALUES LESS THAN (201807) ENGINE = InnoDB,
PARTITION before20180901 VALUES LESS THAN (201809) ENGINE = InnoDB,
PARTITION before20181101 VALUES LESS THAN (201811) ENGINE = InnoDB,
PARTITION before20190101 VALUES LESS THAN (201901) ENGINE = InnoDB,
PARTITION future VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */;`
Explain:
explain partitions SELECT SQL_NO_CACHE a.query AS
keyword,
Coalesce(Sum(impressions), 0) AS
impressions,
Coalesce(Sum(clicks), 0) AS
clicks,
Round(Coalesce(( Sum(clicks) / Sum(impressions) ) * 100, 0), 2) AS
clickThroughRate,
Round(Coalesce(Sum(position) / Sum(IF(position = 0, 0, 1)), 0), 2) AS
avgPosition,
Coalesce(b.starred, 0) AS
starred
FROM a
LEFT JOIN b
ON b.queryid = a.queryid
AND b.query = a.query
WHERE a.reportdate >= '2018-05-01'
AND a.reportdate <= '2018-10-25'
AND a.quarteryear <= 201810
AND a.quarteryear >= 201710
GROUP BY a.query
ORDER BY impressions DESC
LIMIT 0, 30;`
`id select_type table partitions type possible_keys key key_len ref rows Extra
1 SIMPLE a before20171101,before20180101,before20180301,before20180401,before20180701,before20180901,before20181101 ALL reportdate NULL NULL NULL 1673926 Using where; Using temporary; Using filesort
1 SIMPLE b NULL eq_ref PRIMARY PRIMARY 56 J1001.a.queryId,J1001.a.query 1 Using where
Update: Dimension table with auto increment id to link query
CREATE TABLE `dataTable` (
`id` INT(11) unsigned NOT NULL AUTO_INCREMENT,
`reportdate` date NOT NULL,
`clicks` mediumint(11) NOT NULL COMMENT 'metric',
`impressions` mediumint(11) NOT NULL COMMENT 'metric',
`position` double(10,2) NOT NULL COMMENT 'metric',
`queryId` mediumint(8) unsigned NOT NULL,
`Device` enum('DESKTOP','MOBILE','TABLET') NOT NULL,
PRIMARY KEY (`reportdate`,`id`),
index (`id`),
FOREIGN KEY (`queryId`) references settingsTable(`id`) ###auto increment primary id on settings table
);
CREATE TABLE `settingsTable` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`query` varchar(300) NOT NULL COMMENT 'dimension',
`queryId` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `queryId` (`queryId`)
) ENGINE=InnoDB AUTO_INCREMENT=65536 DEFAULT CHARSET=utf8;
SELECT SQL_NO_CACHE settingsTable.query AS
keyword,
Coalesce(Sum(impressions), 0) AS
impressions,
Coalesce(Sum(clicks), 0) AS
clicks,
Round(Coalesce(( Sum(clicks) / Sum(impressions) ) * 100, 0), 2) AS
clickThroughRate,
Round(Coalesce(Sum(position) / Sum(IF(position = 0, 0, 1)), 0), 2) AS
avgPosition
FROM dataTable left join
settingsTable on dataTable.queryid = settingsTable.id
WHERE dataTable.reportdate >= '2018-01-25'
AND dataTable.reportdate <= '2018-10-25'
GROUP BY dataTable.queryid
ORDER BY impressions DESC
LIMIT 0, 30;
explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE dataTable range PRIMARY PRIMARY 3 NULL 1584326 Using where; Using temporary; Using filesort
1 SIMPLE settingsTable eq_ref PRIMARY PRIMARY 3 DB.dataTable.queryId 1 NULL
Best Answer
PARTITIONing
does not necessarily help performance. Certainly now in this case, since you are not "pruning" onquarteryear
. You could do the pruning if you changed toPARTITION BY RANGE(TO_DAYS(reportDate))
and changed the constants in the partition definitions suitably.Will you be using
DROP PARTITION
to efficiently remove old rows? That is one of the few reasons to every usePARTITIONing
. I discuss it here .Fix those, plus the dimension for
query
, then let's make a second pass over the issues with the query.How many rows in each table? (I am fishing for the impact of the
JOIN
andGROUP BY
; I may want to recommend a different approach.)Without partitioning...
I predict that if you remove partitioning, remove
quarteryear
, addid
back in, and change toThe query will run faster. (I don't know about other queries.)
Note that having
queryid
in the index does not help. (See my indexing cookbook.)Cardinality...
Yes 17% is about where it is expected. I have seen less; I have seen more. The reason is that bouncing between the index and the table is somewhat costly. So, at some point, it is more efficient to simply scan the table. (The Optimizer is imperfect at picking the right cutoff.)
The trick in the above paragraph is to arrange for the rows needed to be consecutive in the data. This was done by putting
reportdate
first in thePRIMARY KEY
. That way, the cardinality issue vanishes.However, other queries may suffer. Please provide the main queries so I can juggle the tradeoffs.
Summary Tables
But the real speedup will come from building and maintaining Summary Table(s) so that the query does not have to shovel through so many rows.
Correctness of Summary tables... If you never delete or update, or insert after the fact, then it is rather straightforward to assure that the summarization (for 'yesterday', for example) is exact. If you do need to rebuild the summary, have a script handy that does such.
Another building technique is to use an increment (usually with IODKU to insert or increment) to do the increment as you go. Recommend providing a Stored Procedure that does both tasks as one.
Query id as a hash
You were right to change direction. Have only an
INT
in the current table. HaveINT AUTO_INCREMENT
, the query, and hash in another table. At that point, the hash may as well be MD5, which is extremely unlikely to have a dup before the computer dies of old age. Indexing any kind of hash in a huge table can be costly, so minimize how often you do it.See mysqlslow for one way to turn SQL queries into a 'canonical' form. It replaces numbers and strings by
?
, thereby whittling down the number of 'different' queries from millions to thousands. (I have done a similar thing when collecting slowlogs from hundreds of servers.)