I'm creating a database that mirrors our production database, but is lighter and anonymised – for local development purposes.
Ensuring we've enough data to operate on as an engineering team, I'm removing all customers that have their updated_at
date set to over a year ago. Simple process is to keep new users, but bin old or inactive users.
For this, I've created a stored procedure.
DELIMITER //
CREATE PROCEDURE delete_old_customers()
BEGIN
SET @increment = 0;
customer_loop: LOOP
DELETE FROM customers
WHERE id BETWEEN @increment AND @increment+999
AND updated_at < DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR);
IF @increment > (SELECT MAX(id) FROM customers) THEN
LEAVE customer_loop;
END IF;
SET @increment = @increment + 1000;
END LOOP customer_loop;
END //
DELIMITER ;
CALL delete_old_customers();
DROP PROCEDURE delete_old_customers;
So this procedure batches the removal into groups of 1000, and runs until there are no more customers to process.
I run the procedure like this:
mysql "$MYSQLOPTS" devdb < ./queries/customer.sql
Where the $MYSQLOPTS
refers to a my.cnf file with the following options:
[mysqld]
innodb_buffer_pool_size = 6G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_thread_concurrency = 0
innodb_write_io_threads = 64
innodb_flush_log_at_trx_commit = 0
query_cache_size = 0
The problem is that due to the FKs and references this table has, this process can take up to 3 hours to remove ~800k users; and of course, as time goes on, this is only going to grow.
This is running on a Quad-core, 8GB RAM, Digital Ocean Droplet; so I've only limited means to work within.
So given this, I'd love the opportunity to begin optimising this procedure to improve its speed, but I'm unsure of where to start. I'm also open to alternative methods to achieve the same aim.
Best Answer
I prefer the next strategy: the stored routine that fill up the table on each inserted record also delete few expired ones. This look like that:
Insertion/deletion ratio is set to 1:3 just to ensure I get the reasonable removal rate even when incoming data rate become low due to daily/weekly/monthly oscillations. It is acceptable for established bases with low expired records count. If you want to perform initial cleanup, then you have to set the
LIMIT
to the value that do not insult your server performance.If you have low incoming data rate then you can create the special routine ad hoc:
Huge
DELETE
will be splitted into the series of small ones that can't lock the tables for a while.