MySQL Large DELETE with JOINs

amazon-rdsdeleteMySQL

I have a large table players with ~20MM rows which joins to a table stats with ~300MM rows, among others (photos, …). I want to DELETE all records for players that were born_on before 1950. I should mention this is a Rails app, so I haven't properly constrained any of these relations with Foreign Keys (and I have a good number of indices on each of these tables, listed at the bottom).

[Edit] I would also like to delete all related data from stats, photos, etc. All-in-all, I would like to delete associations from about 10 tables, some of which are second-degree

I would expect to delete the players with a query like so [updated]:

DELETE FROM "players","stats","photos"
USING "players
  LEFT JOIN "stats"
    ON "players".id = "stats".player_id
  LEFT JOIN "photos"
    ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01"

But, I'm running into an issue that that query, as it's a huge delete, is taking significantly too long and too many resources to complete [ever]. As I'm running the query with a JOIN, MySQL won't let me limit the DELETE to break the query into chunks.

Alternatively, I've tried deleting each row separately:

SELECT id FROM "players" where "players".born_on < "1950-01-01";

Then

DELETE FROM "players" WHERE "players".id = 5;
DELETE FROM "stats" WHERE "stats".player_id = 5;
DELETE FROM "photos" WHERE "photos".player_id = 5;
... repeat

But the amount of parallelization necessary to complete these on say 10MM rows also throttles the database to 100% CPU (running m1.xlarge on Amazon RDS), rendering complete downtime for what would be several days of query.

My question is, what is the best way to delete these old rows from the database without incurring significant downtime for my application. Are there settings that could help, etc. that would make this simple and effective.

Please feel free to ask more questions about configuration, etc. as necessary to solve this problem. Thanks in advance for all of your help!

[Edit]

Schema

Players Table

CREATE TABLE `players` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `first_name` varchar(255) DEFAULT NULL,
  `middle_name` varchar(255) DEFAULT NULL,
  `last_name` varchar(255) DEFAULT NULL,
  `birth_date` datetime DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `team_id` int(11) DEFAULT NULL,
  `jersey_name` varchar(255) DEFAULT NULL,
  `home_city` varchar(255) DEFAULT NULL,
  `coach_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `players_team_id_last_name` (`team_id`,`last_name`),
  KEY `players_jersey_name` (`jersey_name`),
  KEY `players_home_city` (`home_city`),
  KEY `players_coach_id_index` (`coach_id`)
) ENGINE=InnoDB AUTO_INCREMENT=611 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;

Stats Table

CREATE TABLE `stats` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `player_id` int(11) DEFAULT NULL,
  `bucket_id` int(11) DEFAULT NULL,
  `description` varchar(4096) DEFAULT NULL,
  `meta1` longtext,
  `meta2` longtext,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `status` varchar(255) DEFAULT NULL,
  `confidence` float DEFAULT NULL,
  `viewed_at` datetime DEFAULT NULL,
  `view_count` int(11) DEFAULT '0',
  `reported_at` datetime DEFAULT NULL,
  `reserved` tinyint(1) DEFAULT '0',
  `ref_id` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `stats_player_id_bucket_id` (`player_id`,`bucket_id`),
  KEY `stats_ref_id_player_id_bucket_id` (`ref_id`,`player_id`,`bucket_id`),
  KEY `stats_status` (`status`)
) ENGINE=InnoDB AUTO_INCREMENT=10322 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;

Photos Table

CREATE TABLE `photos` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `player_id` int(11) DEFAULT NULL,
  `image` text,
  `source` varchar(255) DEFAULT NULL,
  `content_type` varchar(255) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `photo_type` varchar(255) DEFAULT NULL,
  `url` varchar(255) DEFAULT NULL,
  `lat` decimal(15,10) DEFAULT NULL,
  `lon` decimal(15,10) DEFAULT NULL,
  `caption` varchar(255) DEFAULT NULL,
  `place_name` varchar(255) DEFAULT NULL,
  `href` varchar(255) DEFAULT NULL,
  `posted_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `photos_player_id` (`player_id`)
) ENGINE=InnoDB AUTO_INCREMENT=313 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;

Best Answer

In MySQL there is a multi-table DELETE syntax. Your first DELETE will delete rows only from the players TABLE. If you want to delete from multiple tables you have to use something like:

DELETE FROM "players","stats","photos"
USING "players"
  LEFT JOIN "stats"
    ON "players".id = "stats".player_id
  LEFT JOIN "photos"
    ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01"

This doesn't address the problem with the long-running DELETE statement though. In fact above query should take even more time, because now it actually would delete rows from stats and photos tables. The workaround you could use is to split the large DELETE into smaller ones. Since you have a nice WHERE condition, you could manually split the deletes on that (for example one DELETE for each ten years of players.born_on) and run them in ascending order, that is:

DELETE FROM "players","stats","photos"
USING "players"
  LEFT JOIN "stats"
    ON "players".id = "stats".player_id
  LEFT JOIN "photos"
    ON "players".id = "photos".player_id
WHERE "players".born_on < "1930-01-01";

DELETE FROM "players","stats","photos"
USING "players"
  LEFT JOIN "stats"
    ON "players".id = "stats".player_id
  LEFT JOIN "photos"
    ON "players".id = "photos".player_id
WHERE "players".born_on < "1940-01-01";

DELETE FROM "players","stats","photos"
USING "players"
  LEFT JOIN "stats"
    ON "players".id = "stats".player_id
  LEFT JOIN "photos"
    ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01";

It this is too coarse (i.e. it takes too long do execute each query) you should make the WHERE conditions even more fine-grained (perhaps delete one year each chunk).

Also there is a --purge option for pt-archiver from Percona Toolkit which would split the data to be deleted in chunks automatically, but it doesn't seem to support the multi table case. See example usage of pt-archiver in this presentation

From your table definitions I see that you don't have an index on players.birth_date (I suppose that this is the column you relate to as born_at in your example queries). This makes the decade chunks approach useless, since every query would have to scan all players table.

If you can't afford having a long table lock for the DELETE to finish, you most likely can't afford to create an index on the birth_date column as well.

You could split the data on another column, PRIMARY KEY is a good bet. You can write a script which would process all the players in chunks of 10000 (or less or more, depending on the length of a single DELETE statement):

DELETE FROM "players","stats","photos"
USING "players"
  LEFT JOIN "stats"
    ON "players".id = "stats".player_id
  LEFT JOIN "photos"
    ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01"
  AND "players".id BETWEEN n*10000+1 AND (n+1)*10000

Where n would be a parameter ranging from 0 to MAX(players.id)/10000 . This way you will avoid a full table scan (which certainly is painful for a 100M table)

You could also try to estimate the DELETE complexity with an EXPLAIN SELECT instead of DELETE:

EXPLAIN SELECT *
FROM "players"
  LEFT JOIN "stats"
    ON "players".id = "stats".player_id
  LEFT JOIN "photos"
    ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01"
  AND "players".id BETWEEN 1 AND 10000