I'm designing a database from which old data has to be purged regularly for legal reasons, and I'm trying to figure out the best way to organise the purging.
Dealing with the main tables is trivial, since they all have a month column and cascading deletes are set up for all owned rows in detail tables. So I can simply iterate over a list of these "master" tables and delete rows older than a given month.
However, there are some tables for which things aren't that simple. They can be referenced from quite a few other tables but their rows have to disappear when they are no longer referenced (because of data protection laws).
I could write some code find out which rows in a certain table aren't referenced anywhere else, based on the foreign key meta data in the database schema.
However, I'd rather lean on the foreign key constraints instead and simply use the moral equivalent of DELETE FROM @TableName
. The constraints keep referenced rows from getting deleted and all unreferenced rows disappear as intended. Hence I would make simply make a second list of table names to which the unconditional DELETE
should be applied during a purge, and that's it.
That solution would certainly be ideal: you can't make it any simpler, and it is directly based on the declared database schema, i.e. the foreign key constraints.
Would that be considered acceptable practice? Are there any drawbacks*?
*) apart from the fact that the consequences of someone dropping a required foreign key constraint are not only unusually dire but also unusually delayed (until the next purge at the end of the year)
Additional considerations
I found the fly in the ointment: the constraint not only blocks the deletion, it also results in an error state and abortion of the current statement. Hence the scheme doesn't work as is. Back to the drawing board then… Any pointers welcome.
Best Answer
One option would be to create views for these "inconvenient" tables that only return rows where all foreign key relationships have been broken. Then you could do unconditional deletes from those views.
You would have the overhead of coding each of these views*, but that's a one-time startup cost. This is essentially the same as what you mentioned already:
But in the end, you will have the "unconditional deletes" in your cleanup job.
Note: the approach below would not scale well on really large tables without indexes, and potentially some form of batching (e.g., putting a TOP 100 and ORDER BY in the view definition
Say you have two "main tables" that reference this "inconvenient table" (the one you need to remove rows from when they are no longer referenced:
The inconvenient table has 3 rows, all 3 of which are referenced in MainTable1, while only 1 is referenced in MainTable2.
Now we need a view that shows all rows in
dbo.InconvenientTable
that are not referenced by the two main tables:Currently all rows are referenced, so this query returns 0 rows:
Now let's delete the 'Two' row from both main tables:
And now the view returns that unreferenced row:
Now we can delete everything from the view, which successfully removes the 'Two' row from our inconvenient table for compliance reasons:
*You could also attempt to automate the creation of the views with dynamic SQL and metadata queries, but that seems risky