I am trying to setup an event, that automatically moves data older than a specified amount of time from a heavily used table to some "archive" table to improve performance.
So far I've only come across answers, that are based on a WHERE
clause that is used for both INSERT INTO SELECT
and the DELETE
statement. Since time passes between both statements, the WHERE
clause could theoretically have different results and thus lead to data loss.
This is my event:
DELIMITER $$
CREATE EVENT IF NOT EXISTS archive_6_month_old_entries
ON SCHEDULE EVERY 1 MINUTE
STARTS CURRENT_TIMESTAMP
DO
begin
START TRANSACTION;
CREATE TEMPORARY TABLE data_to_move
SELECT *
FROM api_log
WHERE (datum - UNIX_TIMESTAMP(DATE_ADD(NOW(), INTERVAL -6 MONTH))) < 0 LIMIT 5000;
INSERT INTO api_log_archive
SELECT * FROM data_to_move;
DELETE
FROM api_log
WHERE id in (SELECT id FROM data_to_move);
COMMIT;
end $$
DELIMITER ;
Is there any way to do this without a temporary table?
Best Answer
You say:
No, you don't have to risk losing data! That's what
TRANSACTION
s are for!InnoDB uses (Multi-Version Concurrency Control -
MVCC
). You should read up on this important topic in order to better exploit the capabilities of your server.Something like this will work:
Put this into an
EVENT
and schedule it to run every day at (say) 03:30 every night.Once you are happy that your event is working regularly at the time you require, then all you'll have to do is make monitoring it part of your daily monitoring routine.
Isolation levels
A brief explanation of the difference between the
READ COMMITTED
and theSERIALIZABLE
transaction isolation levels follows:The default
TRANSACTION ISOLATION LEVEL
on MariaDB is READ COMMITTED.So, in terminal one, I do the following
Result:
That's what we want - so now we issue this command:
Result:
That's as we want it - so, now we delve into transactions as follows:
Here, we "suspend" the transaction by switching to another terminal 2...
We run this query:
Result (active queries which are not the one above):
So, we can see that my
SELECT * FROM t;
query is still active and is over 5 minutes old.So, now, again in terminal 2, we insert a row into the table:
And (still terminal 2), I check:
And I go back to terminal 1 and issue a
SELECT * FROM t;
and the result is:Now, to see the behaviour with
TRANSACTION ISOLATION LEVEL
set toSERIALIZABLE
.We issue the command:
I'm using a local PostgreSQL instance as that's the only thing I have running and running these sorts of commands isn't possible using a fiddle - the equivalents for MariaDB are here - I leave it up to you to construct the required tests with MariaDB.
So, from termina1, we check our settings:
Result:
So, we're good to go. From terminal_1_> (
SERIALIZABLE
) we run;Result:
Now, from terminal_2_> we run our check on active queries again:
Result:
The query in terminal_1_> is sitting there and is now > 3mins old...
From terminal_2_> again, we update the table:
and:
Now we go back to terminal_1_> and run
SELECT * FROM t;
a second time within the same serializable transaction.and we get:
The
SERIALIZABLE
transaction has the same view of tablet
that it had at the beginning of transaction, despiteterminal_2_>
havingINSERT
ed a new record...I would urge you to experiment with differing setups to get to know your server's capabilities and how you can best exploit them...
In terms of your problem, what this means is that:
Scenario 1 (you keep the defaults on your server - i.e.
REPEATABLE READ
):At time t1, you run this:
Say, for example, you
SELECT
10 records. It's a log table, so it's constantly being updated...At time t2, you
INSERT
into your archive table.At time t3, you DELETE from your api_log
Now, your process will miss any records that have aged to over 6 months between t1 and t3 - so, fine, you say, I'll pick them up the next time...
Scenario 2 (your run your archiving process with the
TRANSACTION ISOLATION LEVEL
set toSERIALIZABLE
).For starters, you can cut out the
CREATE TEMPORARY TABLE
phase as follows:The advantage of this is you don't have the overhead of creating a potentially large
TEMPORARY TABLE
, you reduce the number of steps from 3 to 2 and your system will be more robust.api_log
table during the time of your transaction - it's generally considered not to be good practice if it can be avoided. If you wish to explore this possibility, then you might like to ask another question...This fiddle should help clarify things - PostgreSQL has the useful (and intelligently explicit) functions
TRANSACTION_TIMESTAMP
andCLOCK_TIMESTAMP
- withREAD COMMITTED
any query sees the data as ofCLOCK_TIMESTAMP
and aSERIALIZABLE
transaction sees the data as ofTRANSACTION TIMESTAMP
!