In MySQL there is a multi-table DELETE syntax. Your first DELETE will delete rows only from the players
TABLE. If you want to delete from multiple tables you have to use something like:
DELETE FROM "players","stats","photos"
USING "players"
LEFT JOIN "stats"
ON "players".id = "stats".player_id
LEFT JOIN "photos"
ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01"
This doesn't address the problem with the long-running DELETE statement though. In fact above query should take even more time, because now it actually would delete rows from stats
and photos
tables. The workaround you could use is to split the large DELETE into smaller ones. Since you have a nice WHERE condition, you could manually split the deletes on that (for example one DELETE for each ten years of players.born_on
) and run them in ascending order, that is:
DELETE FROM "players","stats","photos"
USING "players"
LEFT JOIN "stats"
ON "players".id = "stats".player_id
LEFT JOIN "photos"
ON "players".id = "photos".player_id
WHERE "players".born_on < "1930-01-01";
DELETE FROM "players","stats","photos"
USING "players"
LEFT JOIN "stats"
ON "players".id = "stats".player_id
LEFT JOIN "photos"
ON "players".id = "photos".player_id
WHERE "players".born_on < "1940-01-01";
DELETE FROM "players","stats","photos"
USING "players"
LEFT JOIN "stats"
ON "players".id = "stats".player_id
LEFT JOIN "photos"
ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01";
It this is too coarse (i.e. it takes too long do execute each query) you should make the WHERE conditions even more fine-grained (perhaps delete one year each chunk).
Also there is a --purge
option for pt-archiver from Percona Toolkit which would split the data to be deleted in chunks automatically, but it doesn't seem to support the multi table case. See example usage of pt-archiver in this presentation
From your table definitions I see that you don't have an index on players.birth_date
(I suppose that this is the column you relate to as born_at
in your example queries). This makes the decade chunks approach useless, since every query would have to scan all players
table.
If you can't afford having a long table lock for the DELETE to finish, you most likely can't afford to create an index on the birth_date
column as well.
You could split the data on another column, PRIMARY KEY is a good bet. You can write a script which would process all the players in chunks of 10000 (or less or more, depending on the length of a single DELETE statement):
DELETE FROM "players","stats","photos"
USING "players"
LEFT JOIN "stats"
ON "players".id = "stats".player_id
LEFT JOIN "photos"
ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01"
AND "players".id BETWEEN n*10000+1 AND (n+1)*10000
Where n would be a parameter ranging from 0 to MAX(players.id)/10000 . This way you will avoid a full table scan (which certainly is painful for a 100M table)
You could also try to estimate the DELETE complexity with an EXPLAIN SELECT instead of DELETE:
EXPLAIN SELECT *
FROM "players"
LEFT JOIN "stats"
ON "players".id = "stats".player_id
LEFT JOIN "photos"
ON "players".id = "photos".player_id
WHERE "players".born_on < "1950-01-01"
AND "players".id BETWEEN 1 AND 10000
If the two families of data are related then I would generally recommend keeping the in the same database - it makes maintaining integrity easier, for instance taking a reliable and consistent backup is (depending on database engine of course) a single operation.
Of course if the two families of data are very loosely coupled anyway this need not be a concern, and using different databases or even completely different technologies for each becomes more likely to be a good solution particularly as you scale out (your first hardware scaling being too put each family of data on its own (virtual) machine. Remember though that if you do split the data over more than one storage technology you are increasing the amount of knowledge needed to maintain the system.
In MSSQL or similar you could break data with significantly different IO patterns into separate file groups in the same database and store them on different drives if needed for performance - this way all the data can be dealt with atomically for transaction or backup purposes and so forth while still being segregated for performance reasons.
Best Answer
Yes, the primary key can be any column but I think you'd be better off using an ID column for the primary key to ensure it's unique (which is technically impossible to gaurantee with a timestamp primary key). As ypercube suggests, a PK of
(timestamp, id)
and using milliseconds on the time value may be a good option as well when using InnoDB and version 5.6+.You can still create an index on the datetime column which will give similar performance, though if you're pulling back all rows all the time it's hardly an issue to begin with.
You can use a stored procedure to handle multiple pieces of logic, such as
Or, you can use a trigger which would have similar logic and automatically fires for every insert on the table.
100,000 rows isn't that many in terms of a relational database, so you shouldn't have retrieval time being an issue.
Again, it's a small size so pretty much anything works. MySQL seems like an obvious choice if you have previous experience with it.