Sql-server – Purging of unreferenced data via unconditional DELETE, by leaning on foreign key constraints

constraintdatabase-designforeign keysql serversql server 2014

I'm designing a database from which old data has to be purged regularly for legal reasons, and I'm trying to figure out the best way to organise the purging.

Dealing with the main tables is trivial, since they all have a month column and cascading deletes are set up for all owned rows in detail tables. So I can simply iterate over a list of these "master" tables and delete rows older than a given month.

However, there are some tables for which things aren't that simple. They can be referenced from quite a few other tables but their rows have to disappear when they are no longer referenced (because of data protection laws).

I could write some code find out which rows in a certain table aren't referenced anywhere else, based on the foreign key meta data in the database schema.

However, I'd rather lean on the foreign key constraints instead and simply use the moral equivalent of DELETE FROM @TableName. The constraints keep referenced rows from getting deleted and all unreferenced rows disappear as intended. Hence I would make simply make a second list of table names to which the unconditional DELETE should be applied during a purge, and that's it.

That solution would certainly be ideal: you can't make it any simpler, and it is directly based on the declared database schema, i.e. the foreign key constraints.

Would that be considered acceptable practice? Are there any drawbacks*?

*) apart from the fact that the consequences of someone dropping a required foreign key constraint are not only unusually dire but also unusually delayed (until the next purge at the end of the year)

Additional considerations

I found the fly in the ointment: the constraint not only blocks the deletion, it also results in an error state and abortion of the current statement. Hence the scheme doesn't work as is. Back to the drawing board then… Any pointers welcome.

Best Answer

One option would be to create views for these "inconvenient" tables that only return rows where all foreign key relationships have been broken. Then you could do unconditional deletes from those views.

You would have the overhead of coding each of these views*, but that's a one-time startup cost. This is essentially the same as what you mentioned already:

I could write some code find out which rows in a certain table aren't referenced anywhere else

But in the end, you will have the "unconditional deletes" in your cleanup job.


Note: the approach below would not scale well on really large tables without indexes, and potentially some form of batching (e.g., putting a TOP 100 and ORDER BY in the view definition

Say you have two "main tables" that reference this "inconvenient table" (the one you need to remove rows from when they are no longer referenced:

CREATE TABLE dbo.InconvenientTable
(
    Id int IDENTITY(1,1) NOT NULL,
    [Description] varchar(10) NOT NULL,

    CONSTRAINT PK_InconvenientTable 
        PRIMARY KEY (Id)
);
GO

CREATE TABLE dbo.MainTable1
(
    Id int IDENTITY(1,1) NOT NULL,
    [Description] varchar(10) NOT NULL,
    InconvenientTableId int NOT NULL,

    CONSTRAINT FK_MainTable1_InconvenientTable 
        FOREIGN KEY (InconvenientTableId)
        REFERENCES dbo.InconvenientTable (Id)
);

CREATE TABLE dbo.MainTable2
(
    Id int IDENTITY(1,1) NOT NULL,
    [Description] varchar(10) NOT NULL,
    InconvenientTableId int NOT NULL,

    CONSTRAINT FK_MainTable2_InconvenientTable 
        FOREIGN KEY (InconvenientTableId)
        REFERENCES dbo.InconvenientTable (Id)
);

The inconvenient table has 3 rows, all 3 of which are referenced in MainTable1, while only 1 is referenced in MainTable2.

INSERT INTO dbo.InconvenientTable
    ([Description])
VALUES
    ('One'),
    ('Two'),
    ('Three');
GO

INSERT INTO dbo.MainTable1
    ([Description], InconvenientTableId)
VALUES
    ('One', 1),
    ('Two', 2),
    ('Three', 3);

INSERT INTO dbo.MainTable2
    ([Description], InconvenientTableId)
VALUES
    ('Two', 2);

Now we need a view that shows all rows in dbo.InconvenientTable that are not referenced by the two main tables:

CREATE VIEW dbo.InconvenientTable_RowsToDelete
AS
SELECT it.*
FROM dbo.InconvenientTable it
WHERE 
    NOT EXISTS (SELECT NULL FROM dbo.MainTable1 mt1 WHERE mt1.InconvenientTableId = it.Id)
    AND NOT EXISTS (SELECT NULL FROM dbo.MainTable2 mt2 WHERE mt2.InconvenientTableId = it.Id);
GO

Currently all rows are referenced, so this query returns 0 rows:

screenshot of SSMS showing no rows returned by the view

Now let's delete the 'Two' row from both main tables:

DELETE dbo.MainTable2 WHERE InconvenientTableId = 2;
DELETE dbo.MainTable1 WHERE InconvenientTableId = 2;

And now the view returns that unreferenced row:

screenshot of SSMS showing one row

Now we can delete everything from the view, which successfully removes the 'Two' row from our inconvenient table for compliance reasons:

screenshot of SSMS showing 1 row being deleted from the view

*You could also attempt to automate the creation of the views with dynamic SQL and metadata queries, but that seems risky