MariaDB – Automatically Move Old Data to Another Table Without Temporary Table

isolation-levelmariadbmysql-eventtransaction

I am trying to setup an event, that automatically moves data older than a specified amount of time from a heavily used table to some "archive" table to improve performance.

So far I've only come across answers, that are based on a WHERE clause that is used for both INSERT INTO SELECT and the DELETE statement. Since time passes between both statements, the WHERE clause could theoretically have different results and thus lead to data loss.

This is my event:

DELIMITER $$
CREATE EVENT IF NOT EXISTS archive_6_month_old_entries
ON SCHEDULE EVERY 1 MINUTE
STARTS CURRENT_TIMESTAMP
DO
begin
START TRANSACTION;

CREATE TEMPORARY TABLE data_to_move 
SELECT *
FROM api_log
WHERE (datum - UNIX_TIMESTAMP(DATE_ADD(NOW(), INTERVAL -6 MONTH))) < 0 LIMIT 5000;

INSERT INTO api_log_archive
SELECT * FROM data_to_move;

DELETE
FROM api_log
WHERE id in (SELECT id FROM data_to_move);

COMMIT;
end $$
DELIMITER ;

Is there any way to do this without a temporary table?

Best Answer

You say:

but we have the issue that, if in theory, the insert query takes up to 5 minutes, there will be a difference in the result between the INSERT INTO and the DELETE statement, leading to 5 minutes of lost data.

No, you don't have to risk losing data! That's what TRANSACTIONs are for!

InnoDB uses (Multi-Version Concurrency Control - MVCC). You should read up on this important topic in order to better exploit the capabilities of your server.

Something like this will work:

SET TRANSACTION LEVEL SERIALIZABLE; 
  BEGIN TRANSACTION; 

  INSERT INTO api_log_archive
    SELECT *
    FROM api_log
    WHERE (datum - UNIX_TIMESTAMP(DATE_ADD(NOW(), INTERVAL -6 MONTH))) < 0; -- removed LIMIT 5000, it's not necessary

  DELETE FROM api_log
    WHERE (datum - UNIX_TIMESTAMP(DATE_ADD(NOW(), INTERVAL -6 MONTH))) < 0; -- removed LIMIT
  
  COMMIT;

SET TRANSACTION LEVEL READ COMMITTED;  -- or whatever your default is

Put this into an EVENT and schedule it to run every day at (say) 03:30 every night.

Once you are happy that your event is working regularly at the time you require, then all you'll have to do is make monitoring it part of your daily monitoring routine.

Isolation levels

A brief explanation of the difference between the READ COMMITTED and the SERIALIZABLE transaction isolation levels follows:

The default TRANSACTION ISOLATION LEVEL on MariaDB is READ COMMITTED.

So, in terminal one, I do the following

Terminal_1_> SHOW TRANSACTION ISOLATION LEVEL;

Result:

 transaction_isolation
-----------------------
 read committed
(1 row)

That's what we want - so now we issue this command:

Terminal_1_> SELECT * FROM t;

Result:

 x |  y
---+-----
 1 | One
 2 | Two
(2 rows)

That's as we want it - so, now we delve into transactions as follows:

Terminal_1_> BEGIN TRANSACTION;
BEGIN
Terminal_1_> SELECT * FROM t;
 x |  y
---+-----
 1 | One
 2 | Two
(2 rows)

Here, we "suspend" the transaction by switching to another terminal 2...

We run this query:

SELECT 
       pid, 
       age(clock_timestamp(), query_start), 
       usename, query 
FROM pg_stat_activity 
WHERE datname = 'test' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;

Result (active queries which are not the one above):

 pid  |       age       | usename  |      query
------+-----------------+----------+------------------
 3500 | 00:05:46.113281 | postgres | SELECT * FROM t;
(1 row)

So, we can see that my SELECT * FROM t; query is still active and is over 5 minutes old.

So, now, again in terminal 2, we insert a row into the table:

INSERT INTO t VALUES (3, 'Three');

And (still terminal 2), I check:

Terminal_2_> SELECT * FROM t;
 x |   y
---+-------
 1 | One
 2 | Two
 3 | Three
(3 rows)

And I go back to terminal 1 and issue a SELECT * FROM t; and the result is:

 x |   y
---+-------
 1 | One
 2 | Two
 3 | Three
(3 rows)

Now, to see the behaviour with TRANSACTION ISOLATION LEVEL set to SERIALIZABLE.

We issue the command:

SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SERIALIZABLE

I'm using a local PostgreSQL instance as that's the only thing I have running and running these sorts of commands isn't possible using a fiddle - the equivalents for MariaDB are here - I leave it up to you to construct the required tests with MariaDB.

So, from termina1, we check our settings:

SELECT name, setting, short_desc FROM pg_settings WHERE name ILIKE ('default%transaction%o%');

Result:

             name              |   setting    |                          short_desc
-------------------------------+--------------+---------------------------------------------------------------
 default_transaction_isolation | serializable | Sets the transaction isolation level of each new transaction.
 default_transaction_read_only | off          | Sets the default read-only status of new transactions.
(2 rows)

So, we're good to go. From terminal_1_> (SERIALIZABLE) we run;

BEGIN TRANSACTION;
SELECT * FROM t;

Result:

 x |  y
---+-----
 1 | One
 2 | Two
(2 rows)

Now, from terminal_2_> we run our check on active queries again:

SELECT 
       pid, 
       age(clock_timestamp(), query_start), 
       usename, query 
FROM pg_stat_activity 
WHERE datname = 'test' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;

Result:

  pid  |       age       | usename  |      query
-------+-----------------+----------+------------------
 11064 | 00:03:37.694149 | postgres | SELECT * FROM t;
(1 row)

The query in terminal_1_> is sitting there and is now > 3mins old...

From terminal_2_> again, we update the table:

INSERT INTO t VALUES (3, 'Three');

and:

Terminal_2_> SELECT * FROM t;
 x |   y
---+-------
 1 | One
 2 | Two
 3 | Three
(3 rows)

Now we go back to terminal_1_> and run SELECT * FROM t; a second time within the same serializable transaction.

and we get:

Terminal_1_> SELECT * FROM t;
 x |  y
---+-----
 1 | One
 2 | Two
(2 rows)

The SERIALIZABLE transaction has the same view of table t that it had at the beginning of transaction, despite terminal_2_> having INSERTed a new record...

I would urge you to experiment with differing setups to get to know your server's capabilities and how you can best exploit them...

In terms of your problem, what this means is that:

Scenario 1 (you keep the defaults on your server - i.e. REPEATABLE READ):

At time t1, you run this:

CREATE TEMPORARY TABLE data_to_move 
SELECT *
FROM api_log
WHERE (datum - UNIX_TIMESTAMP(DATE_ADD(NOW(), INTERVAL -6 MONTH))) < 0 LIMIT 5000;

Say, for example, you SELECT 10 records. It's a log table, so it's constantly being updated...

At time t2, you INSERT into your archive table.

INSERT INTO api_log_archive
SELECT * FROM data_to_move;

At time t3, you DELETE from your api_log

DELETE
FROM api_log
WHERE id in (SELECT id FROM data_to_move);

Now, your process will miss any records that have aged to over 6 months between t1 and t3 - so, fine, you say, I'll pick them up the next time...

Scenario 2 (your run your archiving process with the TRANSACTION ISOLATION LEVEL set to SERIALIZABLE).

For starters, you can cut out the CREATE TEMPORARY TABLE phase as follows:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE        -- check out MariaDB syntax for this

BEGIN TRANSACTION;

INSERT INTO api_log_archive 
  SELECT id
  FROM api_log
    WHERE (datum - UNIX_TIMESTAMP(DATE_ADD(NOW(), INTERVAL -6 MONTH))) < 0;

DELETE FROM api_log
WHERE id IN 
  SELECT id
  FROM api_log
    WHERE (datum - UNIX_TIMESTAMP(DATE_ADD(NOW(), INTERVAL -6 MONTH))) < 0;

COMMIT;

-- clear up - set your transaction defaults back to your preferred default...

The advantage of this is you don't have the overhead of creating a potentially large TEMPORARY TABLE, you reduce the number of steps from 3 to 2 and your system will be more robust.

  • There is another possibility - you can lock your tables for the duration of your update - however, this means that no writes can occur to the api_log table during the time of your transaction - it's generally considered not to be good practice if it can be avoided. If you wish to explore this possibility, then you might like to ask another question...

This fiddle should help clarify things - PostgreSQL has the useful (and intelligently explicit) functions TRANSACTION_TIMESTAMP and CLOCK_TIMESTAMP - with READ COMMITTED any query sees the data as of CLOCK_TIMESTAMP and a SERIALIZABLE transaction sees the data as of TRANSACTION TIMESTAMP!