MySQL Partitioning – Event Tables with Start and End Datetime Columns

indexMySQLpartitioning

I am curious as to what solutions are out there for partitioning tables that log events, that allow both efficient querying and the advantages of pruning when truncating data.

Suppose I have a simple table that records events from different locations:

tblEvents (
  event_id,
  location_id,
  start_datetime,
  end_datetime
)

Most queries on this table will take the form of:

SELECT    event_id
FROM      tblEvents
WHERE     location_id = @queried_location_id
AND       start_datetime < @queried_end_datetime
AND       end_datetime > @queried_start_datetime

To date, I have partitioned by the location_id, and simply just indexed the datetime columns. Performance-wise, this has sufficed, and I never planned to keep more than a couple of months worth of data in the database, so it seemed future-proof as well.

The problem arose when I actually went to purge data from this table (it is very large and also involved in replication). Using DELETE FROM tblEvents WHERE start_datetime < @some_date proved to be very slow, and created problems with requests being made from other clients (not surprisingly).

Partitioning by either of the datetime columns is no good. Although it solves my data purging problem, it makes the query above inefficient, as it still needs to look at multiple partitions depending on the dates being queried.

Am I missing a common solution here? Is there a more efficient way to purge my data? Or is there a smarter way to partition/index that I have missed?

Best Answer

Do the time ranges overlap for a given location?

If you need to delete all rows where start_datetime < @some_date, then the best way is via PARTITIONing based on start_datetime. Use PARTITION BY RANGE(TO_DAYS(start_datetime)) and base it on days or weeks or months, such that you have about 20-50 partitions. See details.

Plan A:

Once you partition that way, then have the PRIMARY KEY be (location_id, end_datetime, start_datetime), in this order. If those 3 columns are not sufficient to be unique, and, say, you have id AUTO_INCREMENT, then do this:

PRIMARY KEY(location_id, end_datetime, start_datetime, id),
INDEX(id)

Now, let's analyze

WHERE     location_id = @queried_location_id
AND       start_datetime < @queried_end_datetime
AND       end_datetime > @queried_start_datetime

The following steps happen:

"partition pruning" based on start_datetime. This may or may not help much, depending on how new the 'start' is. On average(?), it will eliminate half the partitions (not much benefit).
Using the PK, it will home in on the appropriate location_id and the end_datetime.
Scan forward in the data (since the PK is 'clustered' with the data), checking the rest of those location_ids for a match.
Repeat for each partition not pruned out by step 1
Combine the results from the partition.

Plan B:

If your start..end ranges are predominately recent dates and if you can purge based on end_datetime, then this would probably be better because it would predominately focus on the last partition:

PARTITION BY RANGE(TO_DAYS(end_datetime))

PRIMARY KEY(location_id, start_datetime, end_datetime, id),
INDEX(id)

If the ranges are not overlapping, then folding in this may give significantly more performance.

But... If end_datetime is NULL initially, this will throw the row in to the "first" partition. So... Be sure to have an extra partition at the beginning for that -- else the DROP for purging old records will toss them by mistake. That is, have the first range be VALUES LESS THAN (0). Then have the second range be the oldest month (or week or whatever), and drop that for purging.

Another issue with NULL -- when you do set end_datetime, the record will have to be moved from one partition to another, thereby making the UPDATE statement somewhat costly. (It is effectively a DELETE plus an INSERT.) For this reason, I don't recommend Plan B for very high activity.

Related Solutions

MySQL, “columns partitioning” when multiple columns are individiually important

You are right that mysql will check only one partition for a specific sender_id, but checks all the partitions for a specific receiver_id, as shown here:

mysql> explain partitions select * from messages where sender_id =5;
+----+-------------+----------+------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+----------+------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | messages | p0         | ALL  | NULL          | NULL | NULL    | NULL |    2 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql> explain partitions select * from messages where receiver_id =5;
+----+-------------+----------+------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+----------+------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | messages | p0,p1,p3   | ALL  | NULL          | NULL | NULL    | NULL |    3 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

However, there are still benefits to that partitioning, depending on your hardware. When looking in all partitions for a receiver_id, mysql is really performing 3 select statements, one for each partition. It may be able to parallelize these select statements. Additionally, if you index receiver_id, it will be accessing 3, smaller, indicies.

In the end, you just have to do performance testing and see if it is paying off for your use case. Seeing as 100MB fits in RAM pretty easily these days, I wouldn't consider partitioning such a small table unless you have specific reasons to do so.

SQL Server – Is Table Partitioning Improving Performance?

Using partitioning is only going to help your query performance if the partitioning scheme is built to serve your specific queries.

You're going to have to review your query patterns and see how they are accessing the table in order to identify the best approach. The reason for this is you can only partition on a single column (the partitioning key) and this is what will be used for partition elimination.

There are two factors that affect if partition elimination can occur and how well it will perform:

Partition Key - Partitioning can only occur on a single column and your query must include that column. For example, if your table is partitioned on date and your query uses that date column, then partition elimination should occur. However, if you don’t include the partition key within the query predicate, the engine can not perform elimination.
Granularity - If your partitions are to big, you won’t gain any benefit from elimination because it will still pull back more data than it needs to. However, make it to small and it becomes difficult to manage.

In many ways, partitioning is just like using any other index, with some added benefits. However, you don’t realize those benefits unless you're dealing with incredibly large tables. Personally, I don't even consider partitioning until my table is over 250 GB in size. Most of the time, well defined indexing will cover many of the use cases on tables smaller than that. Based on your description, you're not seeing huge data growth, so it could be that a properly index table will perform just fine for your table.

I would strongly recommend that you review whether partitioning is actually necessary to solve your issues. One would usually partition a very large table for the purpose of:

Distributing data between different types of disk so that more “active” data can be placed on faster, more expensive storage while less active data is placed on cheaper, slower storage. This is mostly a cost savings measure.
Assisting in index maintenance for extremely large tables. Since you can rebuild partitions individually, this can assist in keeping indexes properly maintained with minimal impact.
Leveraging partitioning for improved archival process. See sliding windows.

Best Answer

Related Solutions

MySQL, “columns partitioning” when multiple columns are individiually important

SQL Server – Is Table Partitioning Improving Performance?

Related Question