Using Mysql 5.7
We are building a click tracking platform, where we record
CREATE TABLE `click`
(
`id` INTEGER NOT NULL AUTO_INCREMENT,
`ref_id` VARCHAR(500),
`datetime` TIMESTAMP NOT NULL,
`offer_id` INTEGER NOT NULL,
`publisher_id` INTEGER,
`advertiser_id` INTEGER,
`adid` VARCHAR(50),
`app_name` VARCHAR(80),
`sub_pub` VARCHAR(225),
`custom1` VARCHAR(225),
`custom2` VARCHAR(225),
`country_id` INTEGER,
`os_id` INTEGER,
`detected_user_agent` VARCHAR(1000),
`detected_device_ip` VARCHAR(25),
`detected_referrer` VARCHAR(1000),
`raw_info` VARCHAR(255),
PRIMARY KEY (`id`),
INDEX `click_fi_273355` (`offer_id`),
INDEX `click_fi_35872e` (`publisher_id`),
INDEX `click_fi_d78542` (`advertiser_id`),
INDEX `click_fi_6a2e9f` (`country_id`),
INDEX `click_fi_e2bd35` (`os_id`),
CONSTRAINT `click_fk_273355`
FOREIGN KEY (`offer_id`)
REFERENCES `offer` (`id`),
CONSTRAINT `click_fk_35872e`
FOREIGN KEY (`publisher_id`)
REFERENCES `publisher` (`id`),
CONSTRAINT `click_fk_d78542`
FOREIGN KEY (`advertiser_id`)
REFERENCES `advertiser` (`id`),
CONSTRAINT `click_fk_6a2e9f`
FOREIGN KEY (`country_id`)
REFERENCES `country` (`id`),
CONSTRAINT `click_fk_e2bd35`
FOREIGN KEY (`os_id`)
REFERENCES `operating_system` (`id`)
) ENGINE=InnoDB
And every time there is an install I save all the columns from click table to install table with some extra columns.
the redundancy is because I will be deleting old click granular data every two weeks but I need the granular data of clicks for which there is an install.
CREATE TABLE `install`
(
`click_id` INTEGER NOT NULL,
`click_time` DATETIME NOT NULL,
`datetime` DATETIME NOT NULL,
`income` FLOAT(6,2) NOT NULL,
`payout` FLOAT(6,2) NOT NULL,
`transaction_sum` FLOAT(6,2),
`type` TINYINT(1) DEFAULT 1 NOT NULL,
`ref_id` VARCHAR(500),
`offer_id` INTEGER NOT NULL,
`publisher_id` INTEGER,
`advertiser_id` INTEGER,
`adid` VARCHAR(50),
`app_name` VARCHAR(80),
`sub_pub` VARCHAR(225),
`custom1` VARCHAR(225),
`custom2` VARCHAR(225),
`country_id` INTEGER,
`os_id` INTEGER,
`detected_user_agent` VARCHAR(1000),
`detected_device_ip` VARCHAR(25),
`detected_referrer` VARCHAR(1000),
`raw_info` VARCHAR(255),
PRIMARY KEY (`click_id`),
INDEX `install_fi_273355` (`offer_id`),
INDEX `install_fi_35872e` (`publisher_id`),
INDEX `install_fi_d78542` (`advertiser_id`),
INDEX `install_fi_6a2e9f` (`country_id`),
INDEX `install_fi_e2bd35` (`os_id`),
CONSTRAINT `install_fk_273355`
FOREIGN KEY (`offer_id`)
REFERENCES `offer` (`id`),
CONSTRAINT `install_fk_35872e`
FOREIGN KEY (`publisher_id`)
REFERENCES `publisher` (`id`),
CONSTRAINT `install_fk_d78542`
FOREIGN KEY (`advertiser_id`)
REFERENCES `advertiser` (`id`),
CONSTRAINT `install_fk_6a2e9f`
FOREIGN KEY (`country_id`)
REFERENCES `country` (`id`),
CONSTRAINT `install_fk_e2bd35`
FOREIGN KEY (`os_id`)
REFERENCES `operating_system` (`id`)
) ENGINE=InnoDB;
1: We are getting around 40 million clicks per day which results to 40 million new rows added to the click table each day and around 5000 rows in install table.
2: I have to generate an aggregated report for the users in the dashboard with filters and grouping for date and "*_id" columns.
example 1:
SELECT DATE(datetime) AS item, COUNT(click.clickid) AS clicks, COUNT(install.clickid) AS installs, SUM(install.income) AS income, SUM(install.payout) AS payout WHERE datetime > '2018-05-08' AND datetime < '2018-05-17' GROUP BY item LIMIT 25;
example 2:
SELECT offer_id AS item, COUNT(click.clickid) AS clicks, COUNT(install.clickid) AS installs, SUM(install.income) AS income, SUM(install.payout) AS payout WHERE publisher_id IN (123,456,78) AND datetime > '2018-05-08' AND datetime < '2018-05-17' GROUP BY item LIMIT 25;
Queries like this to the click table is obviously(or so i think) is not a good idea. so i created a summery table to pre-aggregate the data which aggregates for date and hour.
The data for this table comes from click table and install table.
CREATE TABLE `stats_main`
(
`datetime` DATETIME NOT NULL,
`offer_id` INTEGER NOT NULL,
`publisher_id` INTEGER,
`advertiser_id` INTEGER,
`country_id` INTEGER DEFAULT 1,
`os_id` INTEGER DEFAULT 1,
`sub_pub` VARCHAR(225) DEFAULT 'empty',
`clicks` INTEGER DEFAULT 0,
`approved` INTEGER DEFAULT 0,
`approved_income` FLOAT(6,2) DEFAULT 0.00,
`over_capped` INTEGER DEFAULT 0,
`over_capped_income` FLOAT(6,2) DEFAULT 0.00,
`internal` INTEGER DEFAULT 0,
`internal_income` FLOAT(6,2) DEFAULT 0.00,
`payout` FLOAT(6,2) DEFAULT 0,
INDEX `stats_main_fi_273355` (`offer_id`),
INDEX `stats_main_fi_35872e` (`publisher_id`),
INDEX `stats_main_fi_d78542` (`advertiser_id`),
INDEX `stats_main_fi_6a2e9f` (`country_id`),
INDEX `stats_main_fi_e2bd35` (`os_id`),
CONSTRAINT `stats_main_fk_273355`
FOREIGN KEY (`offer_id`)
REFERENCES `offer` (`id`),
CONSTRAINT `stats_main_fk_35872e`
FOREIGN KEY (`publisher_id`)
REFERENCES `publisher` (`id`),
CONSTRAINT `stats_main_fk_d78542`
FOREIGN KEY (`advertiser_id`)
REFERENCES `advertiser` (`id`),
CONSTRAINT `stats_main_fk_6a2e9f`
FOREIGN KEY (`country_id`)
REFERENCES `country` (`id`),
CONSTRAINT `stats_main_fk_e2bd35`
FOREIGN KEY (`os_id`)
REFERENCES `operating_system` (`id`)
) ENGINE=InnoDB;
with this setup the entries are less but not less enough since the entries keep increasing every hour for each of the ids for example
datetime |offer_id |publisher_id |country_id |os_id | s_installs
2018-05-17 01:00:00 1 1 1 1 3
2018-05-17 01:00:00 2 1 8 2 3
2018-05-17 01:00:00 3 17 112 3 3
2018-05-17 01:00:00 4 3 6 1 3
2018-05-17 01:00:00 5 2 1 2 3
2018-05-17 01:00:00 1000 25 256 3 3
........
2018-05-17 02:00:00 1 1 1 1 3
2018-05-17 02:00:00 2 1 8 2 3
2018-05-17 02:00:00 3 17 112 3 3
2018-05-17 02:00:00 4 3 6 1 3
2018-05-17 02:00:00 5 2 1 2 3
2018-05-17 02:00:00 1000 25 256 3 3
Now getting an aggregated report on this new summery table is still slow
as the query time increases with the added rows.
With only 120000 rows for 2 days this query takes around 3-4 seconds.
I assume the time would keep on increasing as the entries get larger and larger with the coming dates.
SELECT DATE(datetime) AS item, SUM(clicks) AS clicks, SUM(approved+over_capped+internal) AS total_installs,
ROUND(SUM(approved_income+over_capped_income+scrubbed_income),2) AS total_income,
SUM(over_capped) AS over_capped,
ROUND(SUM(over_capped_income),2) AS over_capped_income,
SUM(internal) AS internal,
ROUND(SUM(internal_income),2) AS internal_income
ROUND(IFNULL((SUM(approved+over_capped+internal) / SUM(clicks)) * 100, 0),3) AS cr,
ROUND(IFNULL((SUM(approved_income+over_capped_income+internal_income) / SUM(clicks)) * 100, 0),3) AS epc,
ROUND(SUM(approved_income+internal_income),2) AS approved_income,
ROUND(SUM(payout),2) AS payout,
ROUND(SUM(approved_income+scrubbed_income-payout),2) AS approved_earnings
from stats_main
WHERE datetime > '2018-05-08' AND datetime < '2018-05-17' GROUP BY item LIMIT 25;
There is a compound index on stats_main table
ALTER TABLE `stats_main` ADD UNIQUE INDEX `unique_index`(`datetime`, `offer_id`, `publisher_id`, `advertiser_id`, `country_id`, `os_id`);
My question is:
1: Am I doing something wrong?
2: can the structure of the tables be changed to increase query performance? if yes how?
Best Answer
FLOAT(m,n)
should never be used. Perhaps you wantDECIMAL(6,2)
for exact money? And, is 9999.99 big enough? I suspect it is not big enough for the summary tables?click
is awfully bulky for getting 40M inserts per second (5/sec). Can some of thoseVARCHARs
be normalized or eliminated? Perhaps even normalize certain clumps of them.If you are purging 'old' (2-week-old) data, consider
PARTITION BY RANGE(..)
with daily partitions, thenDROP PARTITION
is very fast. (More discussion: http://mysql.rjweb.org/doc.php/partitionmaint )The key test for whether a Summary table summarizes enough is the ratio or rows. One-tenth as many summarized rows as raw rows is a good target.
Please provide
SHOW CREATE TABLE
for the summary table. Based on your two samples, I would expect something like:(I recommend picking the appropriate integer sizes everywhere; it saves a little space, thereby improving speed a little.)
A 'bug':
Leaves out midnight on the first day.
For a week's worth of data, I recommend this pattern:
(Your first example has
DATE(datetime) AS item
-- smells wrong.)The second example becomes
(Note: A
LIMIT
without anORDER BY
is not meaningful.)Your sample output shows a breakdown by hour. That is possible.
You will probably need more than one Summary table. The one above handles any combination of publisher_id and/or offer_id, broken down by day.
If you break down the data by hour, you will probably need fewer other parts to the
PRIMARY KEY
so as to get close to the target of '10'.Every day (or hour) add new records in the summary tables for that day (or hour) from
stats_main
. Or you can use IODKU as you go. Either avoids the "120000 and growing" problem you mentioned. (More in the link below.)More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
"save all the columns from click table to install table" -- I don't understand. Perhaps there is another way to do whatever the purpose is?