MySQL bulk deletions

deleteinsertMySQL

I am currently looking at how to most efficiently delete rows from a table in bulk. The database I'm using is MySQL with innodb.

The table I want to delete from can be any size, but will usually be ~10gb, and has a primary key (int) and indexed hash. This hash is non-unique, and I do drop it before proceeding with the deletes. The table has no other indexes.

I have a second table containing only the id's I wish to delete. The number of id's to delete can be anything from none, to everything.

The problem is I can't decide on a strategy on how to delete these rows quickly. I am aware of 2, which serve opposite sides of the extreme.

Option 1: The standard DELETE FROM table_data WHERE id IN table_todelete. This is good if there are very few deletions.

Option 2: Create a new table. Insert rows from table_data where id NOT in table_todelete. Drop table_data and rename the newly created table. This is obviously better when there are a large number of deletes.

I do not know enough about the performance implications of select, insert, delete, to know how fast they are compared to each other.
My question is, since my data isn't skewed towards always having many or few deletes, which approach is better in terms of performance? Is there some sort of theoretical threshold at which I can swap between the two approaches? If there is a third option I might've not considered, please let me know.

Thanks.

Best Answer

You pretty much answered your own question. Creating a new table will be more efficient when you want to delete a large portion of the table. Just counting the number of records to modify will give you a general idea of the impact. However, you also want to account for time spent updating indexes. If you are deleting a lot of records, you may be concerned about slack space in the table, meaning that each block is only partially filled. Considering all of this together, if you need to delete more than about 30% of a table, then creating a new table would be faster.

On the down-side, creating a second copy of the table will use more disk space, even if only temporarily. It will disrupt database replication, so it is generally not a good option when replication is involved.

If the goal is to minimize impact on the database, then I recommend pt-archiver, from the Percona toolkit. It deletes records in batches and keeps commits to a reasonable size. It's a good option if you want to automate your purge to run on a daily basis.