Mysql – Delete millions of rows, fast

deleteMySQLperformancequery-performance

I currently have a database with the following structure:

id->varchar(9)

time->datetime

val01 – val20->int(11)

val21 – val40->tinyint(1)

id and time are the primary key.
also I defined an index for time.

+-----+---------------------+--------+--------+--------+...+--------+--------+--------+
| id  | time                | val01  | val02  | val03  |...| val38  | val39  | val40  |
+-----+---------------------+--------+--------+--------+...+--------+--------+--------+
| #01 | 2014-02-26 12:25:00 |   56   |    9   |   10   |...|    0   |    0   |    0   |
| #01 | 2014-02-26 12:26:00 |   14   |   89       58   |...|    0   |    1   |    0   |
| #01 | 2014-02-26 12:27:00 |   52   |   91   |   68   |...|    0   |    1   |    1   |
| #02 | 2014-02-26 12:28:00 |   52   |   30   |   73   |...|    0   |    1   |    1   |
 ....................................................................................
| #01 | 2014-02-28 16:34:00 |   32   |   82   |   86   |...|    0   |    1   |    0   |
| #01 | 2014-02-28 16:35:00 |   28   |   14   |   93   |...|    1   |    1   |    1   |
| #02 | 2014-02-28 16:35:00 |   94   |   95   |   49   |...|    0   |    0   |    0   |
| #02 | 2014-02-28 16:36:00 |   76   |   64   |   30   |...|    1   |    0   |    1   |
+-----+---------------------+--------+--------+--------+...+--------+--------+--------+

The problem is, that I will get millions of new rows everyday.
The table will have about 1,8 billion rows when the service is running for a few months and I would like to prevent it from getting even larger.

Therefore I would like to delete some of the old rows, because there is no need for them anymore.

E.g: I would only keep every 5th record, that is older than 6 months.
That would be about 10 000 000 rows that I could delete a day.

I tried to achieve this with this query:

SET @x := 0;

DELETE FROM mytable WHERE(id,time) IN(
    SELECT id, time FROM
    (
        SELECT id, time, (@x:=@x+1) AS x FROM mytable
        WHERE time < "2013-08-08 00:00:00" 
        ORDER BY time
    )t
    WHERE x MOD 5 != 0
)

Unfortunately this query is much to slow.

How could I improve it? Or is there a different approach that is better?

Best Answer

As mustaccio states, partitioning the data may help, though that might not be practical as a quick solution and you'd still have to optimise such statements a bit.

mysql is said to be particularly inefficient with IN clauses, in this case it may be running that inner query once for every row in mytable which is not going to be efficient. Better but still far from optimal, it may be running the inner query and spooling the results into a temporary table on disk then joining on that.

To avoid IN, you can rearrange actions of the pattern:

DELETE FROM mytable1 WHERE value IN (SELECT key FROM mytable2 WHERE <filtering_condition>)

into

DELETE t1 
FROM   mytable1 t1
INNER JOIN
       mytable2 t2 
ON     t2.key = ti.id 
WHERE  <filtering_condition>

(in your case both mytable1 and mytable2 are the same table, and that works just as well)

I'm not sure how this will react to the counting-in-a-variable syntax you have there though (I'm not a mysql person specifically and that is not something seen in other DBs I do work with regularly).

If you add an integer identity column (I assume your PK here is id, time hence the current integer ID is not unique) then simply checking it against modulo 5 may be an acceptable approximation of "delete 80% evenly" like so:

DELETE t1 
FROM   mytable t1
WHERE  t1.time < "2013-08-08 00:00:00"
AND    t1.counter MOD 5 != 0

Adding that column initially will be a time consuming process but maintinaing it afterwards should not be a problem (the DB will generate a number for you on each insert, just make sure you don't include it in the VALUES list of any INSERT operation), but you don't need to use IN or JOIN at all. An index over time, counter (instead of just time) may help performance a bit more. I would be tempted to make counter the primary key and id, time a seperate index (as well as an index over time or time, counter), but that would depend a lot on your other operations on the data.

Of course once you are altering the table structure like this, do give consideration the the partitioning option too. It will be more complicated but may have significant beneficial performance impact elsewhere too.