We have a table for logging actions. The volume added up pretty fast. After about 3-4 months, we got 600M rows, we need to store the data for 1 year at least, so it is expected to host 2-3 billions rows.
The structure is not very complicated:
CREATE TABLE `site_statistic` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`entityType` varchar(50) NOT NULL,
`entityId` int(10) unsigned NOT NULL,
`entityCount` int(10) unsigned NOT NULL DEFAULT '1',
`timeStamp` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `entity` (`entityType`,`timeStamp`,`entityId`)
) ENGINE=MyISAM CHARSET=utf8
Sample data:
- when a user logs in:
entityType = "user.login" , entityId = that user Id , entityCount= 1 most of the time - when a product is added:
entityType = "product.add" , entityId = that productId , entityCount= 1 - when a product is clicked:
entityType = "product.clicked" , entityId = that productId , entityCount= 1 - when an album is liked:
entityType = "album .liked" , entityId = that album id.. - so on and so forth.
Currently, data insert is not a problem (yet). But reading the data is very slow. It took 90+ seconds to query product activities for a given period of time:
SELECT
`entityType`,
DATE_FORMAT(FROM_UNIXTIME(`timeStamp`), "%e") AS `category`,
SUM(`entityCount`) as `count`
FROM
`site_statistic`
WHERE
`entityType` IN ('product.displayed','product.clicked','product.add')
AND
`timeStamp` >= 1527267600
AND
`timeStamp` <= 1527872399 AND entityId = 12807
GROUP BY
`entityType`,
`category`;
Mysql crashed (or rather not responding) from time to time when many statistic reports are required at the same time.
Deleting the data (related to a product for example) is also very slow. We need to delete to save space…
What can we do to improve it, apart from disable real time report and send later by email?
P.s.: Our client can be considered small (few guys hosting a website with modest profit) not some crazy enterprise solution.
Best Answer
Use InnoDB.
Normalize
entityType
or turn it into anENUM
.Change
INT UNSIGNED
(always 4 bytes) into a suitable sized int)If the 3-column key is unique, then make it the
PRIMARY KEY
and tossid
id
, if kept, will need to beBIGINT UNSIGNED
-- you are getting too close to the 4 billion limit ofINT UNSIGNED
.The optimal order for the index:
(entityId, entityType, timeStamp)
, or possibly(entityId, timeStamp)
Minimize the number of indexes on a DataWarehouse table.
For purging data after a year,
DELETE
will be painful. InsteadPARTITION BY RANGE
and use months: http://mysql.rjweb.org/doc.php/partitionmaintIf you get in trouble with the ingestion rate: http://mysql.rjweb.org/doc.php/staging_table
If you keep
id
, then dothis trick:
This, together with InnoDB, makes the data for that
SELECT
mostly 'clustered', hence faster to fetch.But the big fix to that query, and similar "report" queries in a Data Warehouse (which what you have), comes from building and maintaining Summary table(s): http://mysql.rjweb.org/doc.php/summarytables
The summary table to support that would look something like:
See the link for various ways to keep the summary table updated.
addenda
Munch on what I said. Then think about how it splitting into multiple similar tables based on entityType.
One simplicity --
count
is always "1" for some Types, so you don't need acount
column (except in the Summary table for that Type).