SQL Server – How to Validate Purge Hasn’t Deleted Wrong Data

sql server

I've just inherited a 17 year old database which has 481 tables, terrible referential integrity and upwards of 800GB of data at some customer sites.

We have a process which goes through and identifies all parents older than x days, then identifies all the children, grandchildren, great-grandchildren etc. of those parents and deletes from the lowest level back up to the parent. We have a QA process for determining that we are deleting the right quantity of data (using table row counts), but when pressed by the CTO, I couldn't commit to saying that we are sure we are deleting the right data

The purge scripts comprise of over 15,000 lines of SQL spread across 50+ SPROCs, so this isn't something that can be solved by a code review or unit testing.

I'm looking for any advice on processes/practices to utilise as part of our QA to validate that we have not deleted the wrong data. The complexity of our application prevents the answer from being as part of application QA and given there is millions of dollars riding on this data, the margin for error is immeasurably small

Best Answer

Declaring foreign keys would be a great start, but I think you know that. While they will ensure what is in the database is consistent they will not ensure it is correct. In other words they can't stop you deleting something that should be kept. For that you need to program the "should be kept" rule somewhere. You've put the "should be removed" rule in stored procedures. I think putting the "should be kept" rule in triggers will give an additional level of assurance.

In past purges we have implemented extensive reconciliation scripts. Counting the rows is a good start. Summing values by multiple combinations of keys is better. SQL Server's GROUPING SET syntax makes this not much less efficient than a simple count. Perform these aggregates before the purge with only the retained rows in the WHERE clause. Preform them again afterwards with no WHERE clause. If too many or too few rows were removed the before and after numbers will be different. Because you summed by lots of different keys it will be easier to identify the offending source rows.

Given we accept an error can occur, it is possible that error will delete some number of rows it shouldn't and leave the same number of rows it should. Unlikely, but possible. By summing the error has to remove / leave exactly balancing rows. This is even less likely. By summing over multiple keys the chances are further reduced. This can be continued until an arbitrary level of assurance is achieved.

I would not suggest deploying this against production, at least not initially. A copy will suffice to perform the cross-checks, and to validate the validation code. If you can confirm here that the deleting procs are working as intended you have confidence to deploy them in production without the extra checks.

Related Solutions

SQL Server – How to Find Out Who Deleted Data

I've not tried fn_dblog on Express but if it is available the following will give you delete operations:

SELECT 
    * 
FROM 
    fn_dblog(NULL, NULL) 
WHERE 
    Operation = 'LOP_DELETE_ROWS'

Take the transaction ID for transactions you're interested in and identify the SID that initiated the transaction with:

SELECT
    [Transaction SID]
FROM
    fn_dblog(NULL, NULL)
WHERE
    [Transaction ID] = @TranID
AND
    [Operation] = 'LOP_BEGIN_XACT'

Then identify the user from the SID:

SELECT
    *
FROM 
    sysusers
WHERE
    [sid] = @SID

Edit: Bringing that all together to find deletes on a specified table:

DECLARE @TableName sysname
SET @TableName = 'dbo.Table_1'

SELECT
    u.[name] AS UserName
    , l.[Begin Time] AS TransactionStartTime
FROM
    fn_dblog(NULL, NULL) l
INNER JOIN
    (
    SELECT
        [Transaction ID]
    FROM 
        fn_dblog(NULL, NULL) 
    WHERE
        AllocUnitName LIKE @TableName + '%'
    AND
        Operation = 'LOP_DELETE_ROWS'
    ) deletes
ON  deletes.[Transaction ID] = l.[Transaction ID]
INNER JOIN
    sysusers u
ON  u.[sid] = l.[Transaction SID]

Sql-server – Validate and Verify Data After SAN Migration

I'd compare the data using, say, Red Gate SQL Data Compare in addition to DBCC. Comparing all data in all tables row by row means checking every populated page.

However, for larger databases this will a long time, so you can run aggregates over columns (say AVG or CHECKSUM_AGG). I use this to quickly compare results but it would scale and force data to be read.

Best Answer

Related Solutions

SQL Server – How to Find Out Who Deleted Data

Sql-server – Validate and Verify Data After SAN Migration

Related Question