Designing SQL Schema for Historical Data as First-Class Citizen

database-designdatabase-recommendationsql server

I'm currently designing a SQL Server database schema for an application that consists of highly structured "tree" data. So, a table might look like this:

id
parentId (references id in same table)
text

This is simple enough, but I also need to be able to update either the parentId of a node or the text of a node (or delete nodes), and access each revision of the tree at the same speed that the current revision can be retrieved. My current solution is to have two tables. First a revision table:

id
revisionDate

And then a modified node table:

id
UID (same for all revisions of a specific node)
revisionId (references id of revision table)
deleted (flag for deleted node)
parentUID (references UID of parent node)
text

So to get the tree for a specific revision, you take the revisionId and all revisionIds before it, then query the node table for all nodes with those revisionIds, then take only the highest revisionId nodes for nodes that share a UID, then delete any node that has the deleted flag set.

This works, but gets ugly fast (I'm using a simplified example) I actually have about 10 tables like this that need revision tracking, with each table having cross references to each other. This approach also breaks foreign keys, since UIDs are not unique. I cannot simply copy the full tree data for each revision, since there may be hundreds of revisions per hour.

What would the best practice be for a problem like this? I'd be willing to use a nonrelational database if it would be clearly superior.

Best Answer

Trying to do FKs will just frustrate you - you don't actually need it.

If you handle versioning by using an InsertDateTime concept, then you're basically describing a Type2 dimension as used by many data warehouses. There is quite a lot of material out there about tuning systems that sit over the top of data warehouses, but from a T-SQL perspective, consider using APPLY. Like this:

SELECT ...
FROM ... AS t
CROSS APPLY (
    SELECT TOP (1) ...
    FROM ... AS rev
    WHERE rev.UID = t.UID
    AND rev.InsertDateTime < @revdatetime
    ORDER BY rev.InsertDateTime DESC) AS something
WHERE something.IsDeleted = 'N'

This kind of construct benefits from an index on (IsDeleted, UID, InsertDateTime DESC).

Alternatively, if you want to look at the whole table as at a particular time, use ROW_NUMBER() like:

WITH numbered AS
 (  SELECT *, ROW_NUMBER() OVER (PARTITION BY UID ORDER BY InsertDateTime DESC) AS rownum
    FROM ...
    WHERE InsertDateTime < @revDateTime
),
VersionAtTime AS
(   SELECT *
    FROM numbered
    WHERE rownum = 1
    AND IsDeleted = 'N'
)
SELECT ...
FROM VersionAtTime
....

(edited from original to move the IsDeleted predicates outside the sub-queries, removing all versions at that particular time)