MySQL Partitioning – Event Tables with Start and End Datetime Columns

indexMySQLpartitioning

I am curious as to what solutions are out there for partitioning tables that log events, that allow both efficient querying and the advantages of pruning when truncating data.

Suppose I have a simple table that records events from different locations:

tblEvents (
  event_id,
  location_id,
  start_datetime,
  end_datetime
)

Most queries on this table will take the form of:

SELECT    event_id
FROM      tblEvents
WHERE     location_id = @queried_location_id
AND       start_datetime < @queried_end_datetime
AND       end_datetime > @queried_start_datetime

To date, I have partitioned by the location_id, and simply just indexed the datetime columns. Performance-wise, this has sufficed, and I never planned to keep more than a couple of months worth of data in the database, so it seemed future-proof as well.

The problem arose when I actually went to purge data from this table (it is very large and also involved in replication). Using DELETE FROM tblEvents WHERE start_datetime < @some_date proved to be very slow, and created problems with requests being made from other clients (not surprisingly).

Partitioning by either of the datetime columns is no good. Although it solves my data purging problem, it makes the query above inefficient, as it still needs to look at multiple partitions depending on the dates being queried.

Am I missing a common solution here? Is there a more efficient way to purge my data? Or is there a smarter way to partition/index that I have missed?

Best Answer

Do the time ranges overlap for a given location?

If you need to delete all rows where start_datetime < @some_date, then the best way is via PARTITIONing based on start_datetime. Use PARTITION BY RANGE(TO_DAYS(start_datetime)) and base it on days or weeks or months, such that you have about 20-50 partitions. See details.

Plan A:

Once you partition that way, then have the PRIMARY KEY be (location_id, end_datetime, start_datetime), in this order. If those 3 columns are not sufficient to be unique, and, say, you have id AUTO_INCREMENT, then do this:

PRIMARY KEY(location_id, end_datetime, start_datetime, id),
INDEX(id)

Now, let's analyze

WHERE     location_id = @queried_location_id
AND       start_datetime < @queried_end_datetime
AND       end_datetime > @queried_start_datetime

The following steps happen:

  1. "partition pruning" based on start_datetime. This may or may not help much, depending on how new the 'start' is. On average(?), it will eliminate half the partitions (not much benefit).
  2. Using the PK, it will home in on the appropriate location_id and the end_datetime.
  3. Scan forward in the data (since the PK is 'clustered' with the data), checking the rest of those location_ids for a match.
  4. Repeat for each partition not pruned out by step 1
  5. Combine the results from the partition.

Plan B:

If your start..end ranges are predominately recent dates and if you can purge based on end_datetime, then this would probably be better because it would predominately focus on the last partition:

PARTITION BY RANGE(TO_DAYS(end_datetime))

PRIMARY KEY(location_id, start_datetime, end_datetime, id),
INDEX(id)

If the ranges are not overlapping, then folding in this may give significantly more performance.

But... If end_datetime is NULL initially, this will throw the row in to the "first" partition. So... Be sure to have an extra partition at the beginning for that -- else the DROP for purging old records will toss them by mistake. That is, have the first range be VALUES LESS THAN (0). Then have the second range be the oldest month (or week or whatever), and drop that for purging.

Another issue with NULL -- when you do set end_datetime, the record will have to be moved from one partition to another, thereby making the UPDATE statement somewhat costly. (It is effectively a DELETE plus an INSERT.) For this reason, I don't recommend Plan B for very high activity.