Mysql – Delete millions of rows, fast

deleteMySQLperformancequery-performance

I currently have a database with the following structure:

id->varchar(9)

time->datetime

val01 – val20->int(11)

val21 – val40->tinyint(1)

id and time are the primary key.
also I defined an index for time.

+-----+---------------------+--------+--------+--------+...+--------+--------+--------+
| id  | time                | val01  | val02  | val03  |...| val38  | val39  | val40  |
+-----+---------------------+--------+--------+--------+...+--------+--------+--------+
| #01 | 2014-02-26 12:25:00 |   56   |    9   |   10   |...|    0   |    0   |    0   |
| #01 | 2014-02-26 12:26:00 |   14   |   89       58   |...|    0   |    1   |    0   |
| #01 | 2014-02-26 12:27:00 |   52   |   91   |   68   |...|    0   |    1   |    1   |
| #02 | 2014-02-26 12:28:00 |   52   |   30   |   73   |...|    0   |    1   |    1   |
 ....................................................................................
| #01 | 2014-02-28 16:34:00 |   32   |   82   |   86   |...|    0   |    1   |    0   |
| #01 | 2014-02-28 16:35:00 |   28   |   14   |   93   |...|    1   |    1   |    1   |
| #02 | 2014-02-28 16:35:00 |   94   |   95   |   49   |...|    0   |    0   |    0   |
| #02 | 2014-02-28 16:36:00 |   76   |   64   |   30   |...|    1   |    0   |    1   |
+-----+---------------------+--------+--------+--------+...+--------+--------+--------+

The problem is, that I will get millions of new rows everyday.
The table will have about 1,8 billion rows when the service is running for a few months and I would like to prevent it from getting even larger.

Therefore I would like to delete some of the old rows, because there is no need for them anymore.

E.g: I would only keep every 5th record, that is older than 6 months.
That would be about 10 000 000 rows that I could delete a day.

I tried to achieve this with this query:

SET @x := 0;

DELETE FROM mytable WHERE(id,time) IN(
    SELECT id, time FROM
    (
        SELECT id, time, (@x:=@x+1) AS x FROM mytable
        WHERE time < "2013-08-08 00:00:00" 
        ORDER BY time
    )t
    WHERE x MOD 5 != 0
)

Unfortunately this query is much to slow.

How could I improve it? Or is there a different approach that is better?

Best Answer

As mustaccio states, partitioning the data may help, though that might not be practical as a quick solution and you'd still have to optimise such statements a bit.

mysql is said to be particularly inefficient with IN clauses, in this case it may be running that inner query once for every row in mytable which is not going to be efficient. Better but still far from optimal, it may be running the inner query and spooling the results into a temporary table on disk then joining on that.

To avoid IN, you can rearrange actions of the pattern:

DELETE FROM mytable1 WHERE value IN (SELECT key FROM mytable2 WHERE <filtering_condition>)

into

DELETE t1 
FROM   mytable1 t1
INNER JOIN
       mytable2 t2 
ON     t2.key = ti.id 
WHERE  <filtering_condition>

(in your case both mytable1 and mytable2 are the same table, and that works just as well)

I'm not sure how this will react to the counting-in-a-variable syntax you have there though (I'm not a mysql person specifically and that is not something seen in other DBs I do work with regularly).

If you add an integer identity column (I assume your PK here is id, time hence the current integer ID is not unique) then simply checking it against modulo 5 may be an acceptable approximation of "delete 80% evenly" like so:

DELETE t1 
FROM   mytable t1
WHERE  t1.time < "2013-08-08 00:00:00"
AND    t1.counter MOD 5 != 0

Adding that column initially will be a time consuming process but maintinaing it afterwards should not be a problem (the DB will generate a number for you on each insert, just make sure you don't include it in the VALUES list of any INSERT operation), but you don't need to use IN or JOIN at all. An index over time, counter (instead of just time) may help performance a bit more. I would be tempted to make counter the primary key and id, time a seperate index (as well as an index over time or time, counter), but that would depend a lot on your other operations on the data.

Of course once you are altering the table structure like this, do give consideration the the partitioning option too. It will be more complicated but may have significant beneficial performance impact elsewhere too.

Related Solutions

Mysql – How to improve InnoDB DELETE performance

You could delete data in batches.

In SQL Server, the syntax is delete top X rows from a table. You then do it in a loop, with a transaction for each batch (if you have more than one statement, of course), so to keep transactions short and maintain locks only for short periods.

In MySQL syntax: DELETE FROM userTable LIMIT 1000

There are restrictions on that (can't use LIMIT in deletes with joins, for instance) but in this case you might be able to do it that way.

There is an additional danger to using LIMIT with DELETE when it comes to replication; the rows deleted are sometimes not deleted in the same order on the slave as it was deleted on the master.

Sql-server – How to speed up query on table with millions of rows

On reason this can happen is that you're using local variables.

The problem is that this query takes so much time to go, despite all of the indexes i've made on different columns.

Here's an example using a similar setup. In the Stack Overflow schema there's a narrow-ish table called Votes that looks like this.

With no index on CreationDate, our only option would be to scan the Clustered Index. But if we create one only on CreationDate, the optimizer can choose to use that index if it thinks doing a Key Lookup for the rest of the columns is cheaper than scanning the Clustered Index and applying a predicate.

CREATE INDEX ix_yourmom ON dbo.Votes(CreationDate)

For this query:

DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';

SELECT *
FROM   dbo.Votes AS v
WHERE  v.CreationDate BETWEEN @StartDate AND @EndDate;
GO

The cardinality estimate for unknown variables using between is 16.4317%. That leads to a clustered index scan and a missing index request for an index that covers the entire query.

If you run the query with RECOMPILE, you allow for the parameter embedding optimization.

DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';

SELECT *
FROM   dbo.Votes AS v
WHERE  v.CreationDate BETWEEN @StartDate AND @EndDate
OPTION ( RECOMPILE );

Which gives us a different query plan, and a more accurate estimate.

Hope this helps!

Best Answer

Related Solutions

Mysql – How to improve InnoDB DELETE performance

Sql-server – How to speed up query on table with millions of rows

Related Question