Mysql – How to optimize thesql table of 2 billions rows

MySQLperformanceperformance-tuning

We have a table for logging actions. The volume added up pretty fast. After about 3-4 months, we got 600M rows, we need to store the data for 1 year at least, so it is expected to host 2-3 billions rows.

The structure is not very complicated:

    CREATE TABLE `site_statistic` (
     `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
     `entityType` varchar(50) NOT NULL,
     `entityId` int(10) unsigned NOT NULL,
     `entityCount` int(10) unsigned NOT NULL DEFAULT '1',
     `timeStamp` int(10) unsigned NOT NULL,
     PRIMARY KEY (`id`),
     KEY `entity` (`entityType`,`timeStamp`,`entityId`)
    ) ENGINE=MyISAM CHARSET=utf8

Sample data:

  • when a user logs in:
    entityType = "user.login" , entityId = that user Id , entityCount= 1 most of the time
  • when a product is added:
    entityType = "product.add" , entityId = that productId , entityCount= 1
  • when a product is clicked:
    entityType = "product.clicked" , entityId = that productId , entityCount= 1
  • when an album is liked:
    entityType = "album .liked" , entityId = that album id..
  • so on and so forth.

Currently, data insert is not a problem (yet). But reading the data is very slow. It took 90+ seconds to query product activities for a given period of time:

   SELECT
            `entityType`,
            DATE_FORMAT(FROM_UNIXTIME(`timeStamp`), "%e") AS `category`,
            SUM(`entityCount`) as `count`
        FROM
            `site_statistic`
        WHERE
            `entityType` IN ('product.displayed','product.clicked','product.add')
                AND
            `timeStamp` >= 1527267600
                AND
            `timeStamp` <= 1527872399 AND entityId = 12807
        GROUP BY
            `entityType`,
            `category`;

Mysql crashed (or rather not responding) from time to time when many statistic reports are required at the same time.

Deleting the data (related to a product for example) is also very slow. We need to delete to save space…

What can we do to improve it, apart from disable real time report and send later by email?

P.s.: Our client can be considered small (few guys hosting a website with modest profit) not some crazy enterprise solution.

Best Answer

  • Use InnoDB.

  • Normalize entityType or turn it into an ENUM.

  • Change INT UNSIGNED (always 4 bytes) into a suitable sized int)

  • If the 3-column key is unique, then make it the PRIMARY KEY and toss id

  • id, if kept, will need to be BIGINT UNSIGNED -- you are getting too close to the 4 billion limit of INT UNSIGNED.

  • The optimal order for the index: (entityId, entityType, timeStamp), or possibly (entityId, timeStamp)

  • Minimize the number of indexes on a DataWarehouse table.

  • For purging data after a year, DELETE will be painful. Instead PARTITION BY RANGE and use months: http://mysql.rjweb.org/doc.php/partitionmaint

  • If you get in trouble with the ingestion rate: http://mysql.rjweb.org/doc.php/staging_table

  • If you keep id, then do

this trick:

PRIMARY KEY (entityId, timeStamp, id)
INDEX(id)

This, together with InnoDB, makes the data for that SELECT mostly 'clustered', hence faster to fetch.

But the big fix to that query, and similar "report" queries in a Data Warehouse (which what you have), comes from building and maintaining Summary table(s): http://mysql.rjweb.org/doc.php/summarytables

The summary table to support that would look something like:

CREATE TABLE summary (
    entityType VARCHAR... -- denormalized
    dy  -- timestamp truncated to the day
    entityId
    sumCount  -- SUM(entityCount); use SUM(sumCount) to get weekly total (etc)
    PRIMARY KEY(entityid, entityType, dy)

See the link for various ways to keep the summary table updated.

addenda

Munch on what I said. Then think about how it splitting into multiple similar tables based on entityType.

One simplicity -- count is always "1" for some Types, so you don't need a count column (except in the Summary table for that Type).