Mysql – Optimising MySQL table for aggregate queries ( 14 columns and half a million entries added each day )

MySQL

Using Mysql 5.7

We are building a click tracking platform, where we record

CREATE TABLE `click`
(
    `id` INTEGER NOT NULL AUTO_INCREMENT,
    `ref_id` VARCHAR(500),
    `datetime` TIMESTAMP NOT NULL,
    `offer_id` INTEGER NOT NULL,
    `publisher_id` INTEGER,
    `advertiser_id` INTEGER,
    `adid` VARCHAR(50),
    `app_name` VARCHAR(80),
    `sub_pub` VARCHAR(225),
    `custom1` VARCHAR(225),
    `custom2` VARCHAR(225),
    `country_id` INTEGER,
    `os_id` INTEGER,
    `detected_user_agent` VARCHAR(1000),
    `detected_device_ip` VARCHAR(25),
    `detected_referrer` VARCHAR(1000),
    `raw_info` VARCHAR(255),
    PRIMARY KEY (`id`),
    INDEX `click_fi_273355` (`offer_id`),
    INDEX `click_fi_35872e` (`publisher_id`),
    INDEX `click_fi_d78542` (`advertiser_id`),
    INDEX `click_fi_6a2e9f` (`country_id`),
    INDEX `click_fi_e2bd35` (`os_id`),
    CONSTRAINT `click_fk_273355`
        FOREIGN KEY (`offer_id`)
        REFERENCES `offer` (`id`),
    CONSTRAINT `click_fk_35872e`
        FOREIGN KEY (`publisher_id`)
        REFERENCES `publisher` (`id`),
    CONSTRAINT `click_fk_d78542`
        FOREIGN KEY (`advertiser_id`)
        REFERENCES `advertiser` (`id`),
    CONSTRAINT `click_fk_6a2e9f`
        FOREIGN KEY (`country_id`)
        REFERENCES `country` (`id`),
    CONSTRAINT `click_fk_e2bd35`
        FOREIGN KEY (`os_id`)
        REFERENCES `operating_system` (`id`)
) ENGINE=InnoDB

And every time there is an install I save all the columns from click table to install table with some extra columns.
the redundancy is because I will be deleting old click granular data every two weeks but I need the granular data of clicks for which there is an install.

CREATE TABLE `install`
(
    `click_id` INTEGER NOT NULL,
    `click_time` DATETIME NOT NULL,
    `datetime` DATETIME NOT NULL,
    `income` FLOAT(6,2) NOT NULL,
    `payout` FLOAT(6,2) NOT NULL,
    `transaction_sum` FLOAT(6,2),
    `type` TINYINT(1) DEFAULT 1 NOT NULL,
    `ref_id` VARCHAR(500),
    `offer_id` INTEGER NOT NULL,
    `publisher_id` INTEGER,
    `advertiser_id` INTEGER,
    `adid` VARCHAR(50),
    `app_name` VARCHAR(80),
    `sub_pub` VARCHAR(225),
    `custom1` VARCHAR(225),
    `custom2` VARCHAR(225),
    `country_id` INTEGER,
    `os_id` INTEGER,
    `detected_user_agent` VARCHAR(1000),
    `detected_device_ip` VARCHAR(25),
    `detected_referrer` VARCHAR(1000),
    `raw_info` VARCHAR(255),
    PRIMARY KEY (`click_id`),
    INDEX `install_fi_273355` (`offer_id`),
    INDEX `install_fi_35872e` (`publisher_id`),
    INDEX `install_fi_d78542` (`advertiser_id`),
    INDEX `install_fi_6a2e9f` (`country_id`),
    INDEX `install_fi_e2bd35` (`os_id`),
    CONSTRAINT `install_fk_273355`
        FOREIGN KEY (`offer_id`)
        REFERENCES `offer` (`id`),
    CONSTRAINT `install_fk_35872e`
        FOREIGN KEY (`publisher_id`)
        REFERENCES `publisher` (`id`),
    CONSTRAINT `install_fk_d78542`
        FOREIGN KEY (`advertiser_id`)
        REFERENCES `advertiser` (`id`),
    CONSTRAINT `install_fk_6a2e9f`
        FOREIGN KEY (`country_id`)
        REFERENCES `country` (`id`),
    CONSTRAINT `install_fk_e2bd35`
        FOREIGN KEY (`os_id`)
        REFERENCES `operating_system` (`id`)
) ENGINE=InnoDB;

1: We are getting around 40 million clicks per day which results to 40 million new rows added to the click table each day and around 5000 rows in install table.

2: I have to generate an aggregated report for the users in the dashboard with filters and grouping for date and "*_id" columns.

example 1:

SELECT DATE(datetime) AS item, COUNT(click.clickid) AS clicks, COUNT(install.clickid) AS installs, SUM(install.income) AS income, SUM(install.payout) AS payout WHERE datetime > '2018-05-08' AND datetime < '2018-05-17' GROUP BY item LIMIT 25;

example 2:

SELECT offer_id AS item, COUNT(click.clickid) AS clicks, COUNT(install.clickid) AS installs, SUM(install.income) AS income, SUM(install.payout) AS payout WHERE publisher_id IN (123,456,78) AND datetime > '2018-05-08' AND datetime < '2018-05-17' GROUP BY item LIMIT 25;

Queries like this to the click table is obviously(or so i think) is not a good idea. so i created a summery table to pre-aggregate the data which aggregates for date and hour.

The data for this table comes from click table and install table.

CREATE TABLE `stats_main`
(
    `datetime` DATETIME NOT NULL,
    `offer_id` INTEGER NOT NULL,
    `publisher_id` INTEGER,
    `advertiser_id` INTEGER,
    `country_id` INTEGER DEFAULT 1,
    `os_id` INTEGER DEFAULT 1,
    `sub_pub` VARCHAR(225) DEFAULT 'empty',
    `clicks` INTEGER DEFAULT 0,
    `approved` INTEGER DEFAULT 0,
    `approved_income` FLOAT(6,2) DEFAULT 0.00,
    `over_capped` INTEGER DEFAULT 0,
    `over_capped_income` FLOAT(6,2) DEFAULT 0.00,
    `internal` INTEGER DEFAULT 0,
    `internal_income` FLOAT(6,2) DEFAULT 0.00,
    `payout` FLOAT(6,2) DEFAULT 0,
    INDEX `stats_main_fi_273355` (`offer_id`),
    INDEX `stats_main_fi_35872e` (`publisher_id`),
    INDEX `stats_main_fi_d78542` (`advertiser_id`),
    INDEX `stats_main_fi_6a2e9f` (`country_id`),
    INDEX `stats_main_fi_e2bd35` (`os_id`),
    CONSTRAINT `stats_main_fk_273355`
        FOREIGN KEY (`offer_id`)
        REFERENCES `offer` (`id`),
    CONSTRAINT `stats_main_fk_35872e`
        FOREIGN KEY (`publisher_id`)
        REFERENCES `publisher` (`id`),
    CONSTRAINT `stats_main_fk_d78542`
        FOREIGN KEY (`advertiser_id`)
        REFERENCES `advertiser` (`id`),
    CONSTRAINT `stats_main_fk_6a2e9f`
        FOREIGN KEY (`country_id`)
        REFERENCES `country` (`id`),
    CONSTRAINT `stats_main_fk_e2bd35`
        FOREIGN KEY (`os_id`)
        REFERENCES `operating_system` (`id`)
) ENGINE=InnoDB;

with this setup the entries are less but not less enough since the entries keep increasing every hour for each of the ids for example

datetime           |offer_id |publisher_id |country_id |os_id | s_installs   
2018-05-17 01:00:00    1         1              1           1        3
2018-05-17 01:00:00    2         1              8           2        3
2018-05-17 01:00:00    3         17             112         3        3
2018-05-17 01:00:00    4         3              6           1        3
2018-05-17 01:00:00    5         2              1           2        3
2018-05-17 01:00:00    1000      25             256         3        3
........


2018-05-17 02:00:00    1         1              1           1        3
2018-05-17 02:00:00    2         1              8           2        3
2018-05-17 02:00:00    3         17             112         3        3
2018-05-17 02:00:00    4         3              6           1        3
2018-05-17 02:00:00    5         2              1           2        3
2018-05-17 02:00:00    1000      25             256         3        3

Now getting an aggregated report on this new summery table is still slow
as the query time increases with the added rows.

With only 120000 rows for 2 days this query takes around 3-4 seconds.
I assume the time would keep on increasing as the entries get larger and larger with the coming dates.

    SELECT DATE(datetime) AS item, SUM(clicks) AS clicks, SUM(approved+over_capped+internal) AS total_installs,
    ROUND(SUM(approved_income+over_capped_income+scrubbed_income),2) AS total_income,
    SUM(over_capped) AS over_capped,
    ROUND(SUM(over_capped_income),2) AS over_capped_income,
    SUM(internal) AS internal,
    ROUND(SUM(internal_income),2) AS internal_income
    ROUND(IFNULL((SUM(approved+over_capped+internal) / SUM(clicks)) * 100, 0),3) AS cr,
    ROUND(IFNULL((SUM(approved_income+over_capped_income+internal_income) / SUM(clicks)) * 100, 0),3) AS epc,
    ROUND(SUM(approved_income+internal_income),2) AS approved_income,
    ROUND(SUM(payout),2) AS payout,
    ROUND(SUM(approved_income+scrubbed_income-payout),2) AS approved_earnings
from stats_main
     WHERE datetime > '2018-05-08' AND datetime < '2018-05-17' GROUP BY item LIMIT 25;

There is a compound index on stats_main table

ALTER TABLE `stats_main` ADD UNIQUE INDEX `unique_index`(`datetime`, `offer_id`, `publisher_id`, `advertiser_id`, `country_id`, `os_id`);

My question is:

1: Am I doing something wrong?

2: can the structure of the tables be changed to increase query performance? if yes how?

Best Answer

FLOAT(m,n) should never be used. Perhaps you want DECIMAL(6,2) for exact money? And, is 9999.99 big enough? I suspect it is not big enough for the summary tables?

click is awfully bulky for getting 40M inserts per second (5/sec). Can some of those VARCHARs be normalized or eliminated? Perhaps even normalize certain clumps of them.

If you are purging 'old' (2-week-old) data, consider PARTITION BY RANGE(..) with daily partitions, then DROP PARTITION is very fast. (More discussion: http://mysql.rjweb.org/doc.php/partitionmaint )

The key test for whether a Summary table summarizes enough is the ratio or rows. One-tenth as many summarized rows as raw rows is a good target.

Please provide SHOW CREATE TABLE for the summary table. Based on your two samples, I would expect something like:

CREATE TABLE Summary (
    -- Keys:
    date DATE NOT NULL,             -- from DATE(datetime)
    publisher_id INTEGER NOT NULL,
    offer_id INTEGER NOT NULL,
    -- Subtotals:
    clicks   MEDIUMINT NOT NULL,   -- from COUNT(*)
    installs SMALLINT NOT NULL,
    income   DECIMAL(8,2) NOT NULL,  -- from SUM(income)
    payout   DECIMAL(8,2) NOT NULL,
    -- indexes:
    PRIMARY KEY(publisher, date, offer_id)
    INDEX(date, offer_id)
) ENGINE=InnoDB;

(I recommend picking the appropriate integer sizes everywhere; it saves a little space, thereby improving speed a little.)

A 'bug':

    WHERE  datetime >  '2018-05-08'
      AND  datetime <  '2018-05-17'

Leaves out midnight on the first day.

For a week's worth of data, I recommend this pattern:

    WHERE  datetime >= '2018-05-08'
      AND  datetime <  '2018-05-08' + INTERVAL 7 DAY

(Your first example has DATE(datetime) AS item -- smells wrong.)

The second example becomes

SELECT  offer_id,
        SUM(clicks) AS clicks,
        SUM(installs) AS installs,
        SUM(income) AS income,
        SUM(payout) AS payout
    FROM Summary
    WHERE  publisher_id IN (123,456,78)
    WHERE  datetime >= '2018-05-08'
      AND  datetime <  '2018-05-08' + INTERVAL 7 DAY
    GROUP BY  offer_id
    ORDER BY something??
    LIMIT  25;

(Note: A LIMIT without an ORDER BY is not meaningful.)

Your sample output shows a breakdown by hour. That is possible.

You will probably need more than one Summary table. The one above handles any combination of publisher_id and/or offer_id, broken down by day.

If you break down the data by hour, you will probably need fewer other parts to the PRIMARY KEY so as to get close to the target of '10'.

Every day (or hour) add new records in the summary tables for that day (or hour) from stats_main. Or you can use IODKU as you go. Either avoids the "120000 and growing" problem you mentioned. (More in the link below.)

More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables

"save all the columns from click table to install table" -- I don't understand. Perhaps there is another way to do whatever the purpose is?