Mysql – Cleanup / Prune Unreferenced Data in Dimension Tables

data-warehouseMySQLpercona-toolkitstar-schema

We have a star schema data warehouse running on MySQL 5.6. We keep a rolling 18 months of data in our fact tables using partitions by month. We have a number of dynamic dimension tables that are referenced by multiple fact tables. However, we have no easy way to remove the rows from dimension tables that are no longer referenced by any fact table. Quick summary looks like this:

dim_url - 1B rows - 360GB
fact_ranks - 2.3B rows - 240GB
fact_seen - 2.8B rows - 295GB

Currently we are attempting to use a combination of Percona Archiver and triggers to generate "used dimension keys" tables, so we can do the process online. We then use the key table to build a new dimension that only has referenced rows. However, we have been unable to complete this process in production and estimate it could take up to a month.

This has to be a common problem with a more elegant solution.

Best Answer

(Normally, I would say "don't bother". But I see that your Dim table is nearly as big as the Fact table. Hmmmm... Perhaps it was unwise to even have the Dim table?? Think about that.)

For building the Used table (and assuming a Trigger handles incoming Inserts):

From the Fact table, build a list of useful values (Used). Do this one chunk at a time, say, 1000 rows, as determined by the PRIMARY KEY. This list goes into an extra table, Used. One transaction per chunk. De-dup and add a PK. See my comments on chunking.

Now, walk through the Dim table, perhaps only 100 rows at a time, and do a multi-table delete, using LEFT JOIN Used ... WHERE ... IS NULL to determine what is not needed.

Be cautious -- the DELETE and the TRIGGER may step on each other. That is, there may be deadlocks necessitating re-execution. But 100 rows at a time should keep that at a nice balance between minimizing deadlocks and maximizing speed. One transaction per chunk.

Yes, it will take time -- both for the discovery of Used and for the DELETEs. But it should not matter. Everything else is humming along, with only very rare and brief interference.

I guess the TRIGGER would need to insert into Used when a row is inserted into Dim.