Mysql – How to optimize query on MYSQL with multiple group by clause

amazon-rdsMySQLperformancequery-performance

I have a web application where users can view data of performance. There will be around 200 to 250 users. Each user will have like
100,000 rows of data per day.

It is only table and here is structure of my table

CREATE TABLE `performance` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `user_id` int(11) NOT NULL,
  `OfferId` varchar(128) COLLATE utf8_unicode_ci NOT NULL,
  `title` varchar(256) COLLATE utf8_unicode_ci NOT NULL,
  `CampaignName` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `Impressions` int(11) NOT NULL,
  `Clicks` int(11) NOT NULL,
  `Ctr` double(10,2) NOT NULL,
  `AverageCpc` int(11) NOT NULL,
  `Cost` int(11) NOT NULL,
  `ConvertedClicks` int(11) NOT NULL,
  `Conversions` int(11) NOT NULL,
  `CostPerConvertedClick` int(11) NOT NULL,
  `ClickConversionRate` double(10,2) NOT NULL,
  `ConversionValue` int(11) NOT NULL,
  `Roi` double(10,2) NOT NULL,
  `AdGroupName` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `Brand` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `Price` double(10,2) NOT NULL,
  `L1` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `L2` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `L3` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `L4` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `L5` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `Attribute0` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `Attribute1` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `Attribute2` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `Attribute3` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `Attribute4` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  `Date` date NOT NULL,
  `DataStatus` tinyint(1) NOT NULL DEFAULT '0',
  `valueChanged` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `attribute_changed` int(11) DEFAULT NULL ,
  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`,`user_id`),
  KEY `performance_offerid_index` (`OfferId`),
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
/*!50100 PARTITION BY HASH (user_id)
PARTITIONS 256 */;

The volume of data in this table will be as high as billions of rows and keep increasing.

The GROUP BY clause can on the following columns and it will be random. It will be controlled from front-end. For example if a user wants to see the Brand column on the front-end then the query will include the Brand columns otherwise it won't.

OfferId will always part of group by clause

Group by columns list

OfferId
CampaignName
AdGroupName
Brand
Attribute0
Attribute1
Attribute2
Attribute3
Attribute4
L1
L2
L3
L4
L5
Title

The aggregate function will apply on these columns

Impressions
Clicks
Ctr
AverageCpc
Cost
ConvertedClicks
Conversions
CostPerConvertedClick
ClickConversionRate
ConversionValue
Roi

In every query the user_id clause and date clause is a must. For the rest there can be multiple conditions ANDed together as the front-end interface allow user to perform filter on data.

The queries is taking too much time. I load 2.5 millions records for user_id 1. and when i run a query with simple group by 3 columns. it take around 250 sec. but the max acceptable time is 5 sec. How can i optimize it ?

The query is

SELECT `OfferId` AS `OfferId`,`Attribute0` AS `Attribute0`,
SUM(`Impressions`) AS `Impressions`,SUM(`Ctr`) AS `Ctr`,
ROUND(AVG(`AverageCpc`),2) AS `AverageCpc`,SUM(`Cost`) AS `Cost`,
SUM(`ConvertedClicks`) AS `ConvertedClicks`,SUM(`Conversions`) AS `Conversions`,
ROUND(AVG(`CostPerConvertedClick`),2) AS `CostPerConvertedClick`,
ROUND(AVG(`ClickConversionRate`),2) AS `ClickConversionRate`,
ROUND(AVG(`ConversionValue`),2) AS `ConversionValue`,
ROUND(AVG(`Roi`),2) AS `Roi`,`id` AS `id` FROM `performance`
 WHERE (`user_id` = 13) 
 AND (`Date` BETWEEN '2016-01-08 00:00:00'
 AND  '2016-02-08 23:59:59')
 GROUP BY `OfferId`,`Attribute0` 
 having Cost > 100
 order by Cost
LIMIT 0,10;

i can add a covering index but the problem is the query is generated randomly like order some times some columns will be included in GROUP BY and sometimes not.

For example lets say I add a covering index 'attribute_index' on

Attribute0
Attribute1
Attribute2
Attribute3
Attribute4

these columns with this order. But there is no guarantee that those 5 columns will always be used in the query. So if only

Attribute0
Attribute4

is used in query the attribute_index will not be used.

I am using RDS Aurora . Any solution to my problem ?

Best Answer

Do not use double(10,2) it involves two roundings. Either do DECIMAL(10,2) or plain DOUBLE. For monetary values (price) use DECIMAL(...) so that there will be no rounding. DOUBLE is 8 bytes; DECIMAL(10,2) is 5 bytes.

Can you have "negative" Clicks? Suggest INT UNSIGNED.

It is usually a bad idea to splay arrays across columns (L1, ... and Attribute1, ...) Instead have another table for them. (OK, I don't know what impact that will have on queries that GROUP BY Attributes.)

PARTITION BY HASH as no known case of improving performance.

attribute_changed int(11) DEFAULT NULL -- If that is just a flag, make it TINYINT UNSIGNED NOT NULL DEFAULT '0'; that will save 3 bytes, plus space for NULL.

Will you have a billion different Campaigns? CampaignName varchar(255) should be normalized and replaced by, say, a MEDIUMINT UNSIGNED (3 bytes) to save a lot of space. Ditto for any other varchars that repeat a lot.

If you expecting billions of rows, squeezing out a few bytes per row can add up. That, in turn, will decrease the I/O, thereby speeding up queries.

Your covering index on the 5 attributes could consume a huge amount of space. Furthermore, with a billion rows, it may slow down INSERTs to one row per disk hit! On traditional drives, that is only 100/sec. You need more than 300/sec.

Since you say the users must include a Date clause, then it may be practical to use PARTITION BY RANGE(TO_DAYS(Date)). It is unclear, but it sounds like user_id is also a requirement in the queries? At that point, I would suggest INDEX(user_id, Date) without partitioning. That composite index is much better than "and index on Date". Adding more columns to that index will not help.

Your example has a Date range of one month plus one day; is that reasonable, or just a goof?

One table per client does not help.

Do not partition by user -- 250 partitions has its own performance problems. About 50 is the 'practical' limit for the number of partitions.

Summary tables (plural) is the only way you will make this multi-billion row table perform adequately. The PRIMARY KEY of each summary table would include user_id and date (probably truncated to the day), plus a couple of other "dimensions". Then several aggregations (mostly COUNTs and SUMs) would be the other columns. These tables would be significantly smaller than your 'Fact' table, and they can afford to have multiple indexes. Beware of AVG because the average of averages is not mathematically correct. Instead store the SUM and COUNT, then calculate the average as SUM(sums)/SUM(counts).

More on Data Warehousing and Summary Tables.

(Yes, some of my statements do disagree with previous Comments and Answers.)