Mysql – Is it faster to split a large table into 12 rolling monthly tables & use UNION them for reports or keep large table & delete rows older than 1 year

database-designMySQLunion

My co-worker wants to split a large 158M row stats table into stats_jan, stats_feb, … and use UNION to select from them for reports. Is that standard practice and is it faster than to just use the large table in place and delete rows older than one year? The table is many small rows.

mysql> describe stats;
+----------------+---------------------+------+-----+---------+----------------+
| Field          | Type                | Null | Key | Default | Extra          |
+----------------+---------------------+------+-----+---------+----------------+
| id             | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| badge_id       | bigint(20) unsigned | NO   | MUL | NULL    |                |
| hit_date       | datetime            | YES  | MUL | NULL    |                |
| hit_type       | tinyint(4)          | YES  |     | NULL    |                |
| source_id      | bigint(20) unsigned | YES  | MUL | NULL    |                |
| fingerprint_id | bigint(20) unsigned | YES  |     | NULL    |                |
+----------------+---------------------+------+-----+---------+----------------+

I did manually split the table up and copy the rows into the appropriate month tables and created a giant UNION query. The large UNION query took 14s versus 4.5m for the single table query. Why would many smaller tables take a significantly shorter time than one large table, when it's the same number of rows total?

create table stats_jan (...);
create table stats_feb (...);
...
create index stats_jan_hit_date_idx on stats_jan (hit_date);
...
insert into stats_jan select * from stats where hit_date >= '2019-01-01' and hit_date < '2019-02-01';
...
delete from stats where hit_date < '2018-09-01';
...

The monthly tables have from 1.7M rows to 35M rows.

select host as `key`, count(*) as value from stats join sources on source_id = sources.id where hit_date >= '2019-08-21 19:43:19' and sources.host != 'NONE' group by source_id order by value desc limit 10;
4 min 30.39 sec

flush tables;
reset query cache;

select host as `key`, count(*) as value from stats_jan join sources on source_id = sources.id where hit_date >= '2019-08-21 19:43:19' and sources.host != 'NONE' group by source_id
UNION
...
order by value desc limit 10;
14.16 sec

Best Answer

Do not split the table. Use Range Partitionig instead. Study MySQL 8.0 Reference Manual / Partitioning. Use MySQL 8.0 Reference Manual / ... / ALTER TABLE Partition Operations. Keep in mind that it is best to create partitions for future periods in advance (and do not forget to create LESS THAN MAXVALUE partition). Creating new partitions and moving existing data to them at the same time can be more expensive.

Do not delete data permanently. Move it into separate archive table. If you do not have enough disk space - make a backup of such an archive table, check its validity, and only if successful then delete the table. If necessary (it will - be sure!), you can recover and use this data.