I'm creating a database that mirrors our production database, but is lighter and anonymised – for local development purposes.
Ensuring we've enough data to operate on as an engineering team, I'm removing all customers that have their
updated_at date set to over a year ago. Simple process is to keep new users, but bin old or inactive users.
For this, I've created a stored procedure.
DELIMITER // CREATE PROCEDURE delete_old_customers() BEGIN SET @increment = 0; customer_loop: LOOP DELETE FROM customers WHERE id BETWEEN @increment AND @increment+999 AND updated_at < DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR); IF @increment > (SELECT MAX(id) FROM customers) THEN LEAVE customer_loop; END IF; SET @increment = @increment + 1000; END LOOP customer_loop; END // DELIMITER ; CALL delete_old_customers(); DROP PROCEDURE delete_old_customers;
So this procedure batches the removal into groups of 1000, and runs until there are no more customers to process.
I run the procedure like this:
mysql "$MYSQLOPTS" devdb < ./queries/customer.sql
$MYSQLOPTS refers to a my.cnf file with the following options:
[mysqld] innodb_buffer_pool_size = 6G innodb_log_buffer_size = 256M innodb_log_file_size = 1G innodb_thread_concurrency = 0 innodb_write_io_threads = 64 innodb_flush_log_at_trx_commit = 0 query_cache_size = 0
The problem is that due to the FKs and references this table has, this process can take up to 3 hours to remove ~800k users; and of course, as time goes on, this is only going to grow.
This is running on a Quad-core, 8GB RAM, Digital Ocean Droplet; so I've only limited means to work within.
So given this, I'd love the opportunity to begin optimising this procedure to improve its speed, but I'm unsure of where to start. I'm also open to alternative methods to achieve the same aim.