I am curious as to what solutions are out there for partitioning tables that log events, that allow both efficient querying and the advantages of pruning when truncating data.
Suppose I have a simple table that records events from different locations:
tblEvents (
event_id,
location_id,
start_datetime,
end_datetime
)
Most queries on this table will take the form of:
SELECT event_id
FROM tblEvents
WHERE location_id = @queried_location_id
AND start_datetime < @queried_end_datetime
AND end_datetime > @queried_start_datetime
To date, I have partitioned by the location_id, and simply just indexed the datetime columns. Performance-wise, this has sufficed, and I never planned to keep more than a couple of months worth of data in the database, so it seemed future-proof as well.
The problem arose when I actually went to purge data from this table (it is very large and also involved in replication). Using DELETE FROM tblEvents WHERE start_datetime < @some_date
proved to be very slow, and created problems with requests being made from other clients (not surprisingly).
Partitioning by either of the datetime columns is no good. Although it solves my data purging problem, it makes the query above inefficient, as it still needs to look at multiple partitions depending on the dates being queried.
Am I missing a common solution here? Is there a more efficient way to purge my data? Or is there a smarter way to partition/index that I have missed?
Best Answer
Do the time ranges overlap for a given location?
If you need to delete all rows where
start_datetime < @some_date
, then the best way is viaPARTITIONing
based onstart_datetime
. UsePARTITION BY RANGE(TO_DAYS(start_datetime))
and base it on days or weeks or months, such that you have about 20-50 partitions. See details.Plan A:
Once you partition that way, then have the
PRIMARY KEY
be(location_id, end_datetime, start_datetime)
, in this order. If those 3 columns are not sufficient to be unique, and, say, you haveid AUTO_INCREMENT
, then do this:Now, let's analyze
The following steps happen:
start_datetime
. This may or may not help much, depending on how new the 'start' is. On average(?), it will eliminate half the partitions (not much benefit).location_id
and theend_datetime
.Plan B:
If your start..end ranges are predominately recent dates and if you can purge based on
end_datetime
, then this would probably be better because it would predominately focus on the last partition:If the ranges are not overlapping, then folding in this may give significantly more performance.
But... If
end_datetime
isNULL
initially, this will throw the row in to the "first" partition. So... Be sure to have an extra partition at the beginning for that -- else theDROP
for purging old records will toss them by mistake. That is, have the first range beVALUES LESS THAN (0)
. Then have the second range be the oldest month (or week or whatever), and drop that for purging.Another issue with
NULL
-- when you do setend_datetime
, the record will have to be moved from one partition to another, thereby making theUPDATE
statement somewhat costly. (It is effectively aDELETE
plus anINSERT
.) For this reason, I don't recommend Plan B for very high activity.