SQL Server – How to Validate Purge Hasn’t Deleted Wrong Data

sql server

I've just inherited a 17 year old database which has 481 tables, terrible referential integrity and upwards of 800GB of data at some customer sites.

We have a process which goes through and identifies all parents older than x days, then identifies all the children, grandchildren, great-grandchildren etc. of those parents and deletes from the lowest level back up to the parent. We have a QA process for determining that we are deleting the right quantity of data (using table row counts), but when pressed by the CTO, I couldn't commit to saying that we are sure we are deleting the right data

The purge scripts comprise of over 15,000 lines of SQL spread across 50+ SPROCs, so this isn't something that can be solved by a code review or unit testing.

I'm looking for any advice on processes/practices to utilise as part of our QA to validate that we have not deleted the wrong data. The complexity of our application prevents the answer from being as part of application QA and given there is millions of dollars riding on this data, the margin for error is immeasurably small

Best Answer

Declaring foreign keys would be a great start, but I think you know that. While they will ensure what is in the database is consistent they will not ensure it is correct. In other words they can't stop you deleting something that should be kept. For that you need to program the "should be kept" rule somewhere. You've put the "should be removed" rule in stored procedures. I think putting the "should be kept" rule in triggers will give an additional level of assurance.

In past purges we have implemented extensive reconciliation scripts. Counting the rows is a good start. Summing values by multiple combinations of keys is better. SQL Server's GROUPING SET syntax makes this not much less efficient than a simple count. Perform these aggregates before the purge with only the retained rows in the WHERE clause. Preform them again afterwards with no WHERE clause. If too many or too few rows were removed the before and after numbers will be different. Because you summed by lots of different keys it will be easier to identify the offending source rows.

Given we accept an error can occur, it is possible that error will delete some number of rows it shouldn't and leave the same number of rows it should. Unlikely, but possible. By summing the error has to remove / leave exactly balancing rows. This is even less likely. By summing over multiple keys the chances are further reduced. This can be continued until an arbitrary level of assurance is achieved.

I would not suggest deploying this against production, at least not initially. A copy will suffice to perform the cross-checks, and to validate the validation code. If you can confirm here that the deleting procs are working as intended you have confidence to deploy them in production without the extra checks.