We have a star schema data warehouse running on MySQL 5.6. We keep a rolling 18 months of data in our fact tables using partitions by month. We have a number of dynamic dimension tables that are referenced by multiple fact tables. However, we have no easy way to remove the rows from dimension tables that are no longer referenced by any fact table. Quick summary looks like this:
dim_url - 1B rows - 360GB
fact_ranks - 2.3B rows - 240GB
fact_seen - 2.8B rows - 295GB
Currently we are attempting to use a combination of Percona Archiver and triggers to generate "used dimension keys" tables, so we can do the process online. We then use the key table to build a new dimension that only has referenced rows. However, we have been unable to complete this process in production and estimate it could take up to a month.
This has to be a common problem with a more elegant solution.
Best Answer
(Normally, I would say "don't bother". But I see that your Dim table is nearly as big as the Fact table. Hmmmm... Perhaps it was unwise to even have the Dim table?? Think about that.)
For building the
Used
table (and assuming a Trigger handles incoming Inserts):From the Fact table, build a list of useful values (
Used
). Do this one chunk at a time, say, 1000 rows, as determined by thePRIMARY KEY
. This list goes into an extra table,Used
. One transaction per chunk. De-dup and add a PK. See my comments on chunking.Now, walk through the Dim table, perhaps only 100 rows at a time, and do a multi-table delete, using
LEFT JOIN Used ... WHERE ... IS NULL
to determine what is not needed.Be cautious -- the
DELETE
and theTRIGGER
may step on each other. That is, there may be deadlocks necessitating re-execution. But 100 rows at a time should keep that at a nice balance between minimizing deadlocks and maximizing speed. One transaction per chunk.Yes, it will take time -- both for the discovery of
Used
and for theDELETEs
. But it should not matter. Everything else is humming along, with only very rare and brief interference.I guess the
TRIGGER
would need to insert intoUsed
when a row is inserted intoDim
.