Soft-deleting change-tracked user content

database-designschema

To begin with, I distinguish between tables containing user-generated content and tables containing information about the user. The former are associated with the user profile, whereas the latter are directly associated with the user. In particular, foreign keys referencing the user profile are set to cascade deletions. Thus, user-generated content is deleted along with the profile, whereas information about the user is retained.

Furthermore, any modification to the content tables is kept track of in separate "change tables". This is accomplished using triggers. After inserting user content, the newly created entry is copied to the corresponding change table. When the entry is updated, only the modified fields are stored in the change table while the rest is left empty. The entry in the change table also references the content entry whose change it tracks as well as the user who authored the change (see figure 1). When a content entry is deleted, the foreign keys referencing that entry in the change table are set to null (for example the user_content_id field in the user_content_change table).

Example Database Schema
Figure 1: Example Database Schema.

Now this is what I am not happy with at all. Consider the following scenario: a user creates a ticket to report an inappropriate comment. The ticket references the comment in question. However, the comment author then decides to delete his or her profile and hence the comment. Although the comment text is still stored in the change table, the reference to the comment is lost, making it hard to review the ticket. (It should be noted that this is a rather typical use case for the change tables. Most of them are queried very occasionally.)

So the alternative I came up with are traditional soft-deletes. By this I mean that the change tracking remains the same but the content tables are each equipped with a deleted_at column. When querying user content, the results where this column is null are filtered out. Since no entry is ever physically deleted, the change tables don't lose their references to the content tables. Moreover, I can merge the user_profile table back into the user table, as it only served as an anchor point for cascading deletions. Regarding performance, I think that I can partition the table on the deleted_at column, so that the table indices aren't cluttered with deleted entries.

Nevertheless, I am also unhappy with this approach. First, it makes querying much more cumbersome and prone to error because one always has to exclude the already deleted entries. Secondly, the convenience of cascading deletions is lost. This applies above all to content that has a lot of metadata and derived content attached to it. An adequate delete trigger may have to update the deleted_at column of 10 to 15 tables — not to speak of a user profile being deleted!

Could you provide some guidance on how to improve this design, or maybe suggest a different approach? Please note that this is the first time I am confronted with designing a schema at a large scale.

Best Answer

Alright, so my solution is to cascade soft-deletions in the after update triggers. Here's an example:

create trigger `tr_user_content_au` after update
    on `user_content` for each row
begin
    if new.`deleted_at` <> null then
        -- The row has been soft-deleted. We assume that no
        -- other fields have been modified.

        -- Soft-delete related rows.
        update `user_content_derived` set `deleted_at` = now();
    else if
        -- Check if any other fields have been modified.
        old.`col_1` <> new.`col_1` or
        old.`col_2` <> new.`col_2` then
        -- Copy modified fields to the change table.
        insert into `user_content_change` (
            `user_content_id`,
            `col_1`,
            `col_2`
        )
        values (
            new.`id`,
            if (old.`col_1` = new.`col_1`, null, old.`col_1`),
            if (old.`col_2` = new.`col_2`, null, old.`col_2`)
        );
    end if;
end;

Also, I use views to filter out already deleted rows:

create view `vw_user_content_visible` as
    select
        `col_1`,
        `col_2`
      from `user_content`
     where `deleted_at` = null;

And finally, I turn even the simplest insert, select or update statements into stored procedures. That's how I mask whether an entry is physically deleted or just soft-deleted. Apart from this, I suppose that people working on application code are less likely to forget about soft-deletions if they are forced to look into the database code to alter queries.