Mysql – Optimise Large Record Deletion Procedure

MySQLoptimizationperformancequery-performancestored-procedures

I'm creating a database that mirrors our production database, but is lighter and anonymised – for local development purposes.

Ensuring we've enough data to operate on as an engineering team, I'm removing all customers that have their updated_at date set to over a year ago. Simple process is to keep new users, but bin old or inactive users.

For this, I've created a stored procedure.

DELIMITER //
CREATE PROCEDURE delete_old_customers()
BEGIN
    SET @increment = 0;
    customer_loop: LOOP

        DELETE FROM customers 
        WHERE id BETWEEN @increment AND @increment+999
        AND updated_at < DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR);

        IF @increment > (SELECT MAX(id) FROM customers) THEN
            LEAVE customer_loop;
        END IF;

        SET @increment = @increment + 1000;

    END LOOP customer_loop;
END //
DELIMITER ;

CALL delete_old_customers();

DROP PROCEDURE delete_old_customers;

So this procedure batches the removal into groups of 1000, and runs until there are no more customers to process.

I run the procedure like this:

mysql "$MYSQLOPTS" devdb < ./queries/customer.sql

Where the $MYSQLOPTS refers to a my.cnf file with the following options:

[mysqld]
innodb_buffer_pool_size = 6G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_thread_concurrency = 0
innodb_write_io_threads = 64
innodb_flush_log_at_trx_commit = 0
query_cache_size = 0

The problem is that due to the FKs and references this table has, this process can take up to 3 hours to remove ~800k users; and of course, as time goes on, this is only going to grow.

This is running on a Quad-core, 8GB RAM, Digital Ocean Droplet; so I've only limited means to work within.

So given this, I'd love the opportunity to begin optimising this procedure to improve its speed, but I'm unsure of where to start. I'm also open to alternative methods to achieve the same aim.

Best Answer

I prefer the next strategy: the stored routine that fill up the table on each inserted record also delete few expired ones. This look like that:

BEGIN
-- lot of code --
INSERT INTO table ...
-- lot of code --
DELETE FROM table AS w WHERE w.expire < NOW() LIMIT 3;
END

Insertion/deletion ratio is set to 1:3 just to ensure I get the reasonable removal rate even when incoming data rate become low due to daily/weekly/monthly oscillations. It is acceptable for established bases with low expired records count. If you want to perform initial cleanup, then you have to set the LIMIT to the value that do not insult your server performance.

If you have low incoming data rate then you can create the special routine ad hoc:

CREATE PROCEDURE table_cleanup()    
BEGIN
main: REPEAT
  DELETE FROM table AS w WHERE w.expire < NOW() LIMIT 1000;
  UNTIL row_count() = 0 END REPEAT main;
END

Huge DELETE will be splitted into the series of small ones that can't lock the tables for a while.