Designing SQL Schema for Historical Data as First-Class Citizen

database-designdatabase-recommendationsql server

I'm currently designing a SQL Server database schema for an application that consists of highly structured "tree" data. So, a table might look like this:

id
parentId (references id in same table)
text

This is simple enough, but I also need to be able to update either the parentId of a node or the text of a node (or delete nodes), and access each revision of the tree at the same speed that the current revision can be retrieved. My current solution is to have two tables. First a revision table:

id
revisionDate

And then a modified node table:

id
UID (same for all revisions of a specific node)
revisionId (references id of revision table)
deleted (flag for deleted node)
parentUID (references UID of parent node)
text

So to get the tree for a specific revision, you take the revisionId and all revisionIds before it, then query the node table for all nodes with those revisionIds, then take only the highest revisionId nodes for nodes that share a UID, then delete any node that has the deleted flag set.

This works, but gets ugly fast (I'm using a simplified example) I actually have about 10 tables like this that need revision tracking, with each table having cross references to each other. This approach also breaks foreign keys, since UIDs are not unique. I cannot simply copy the full tree data for each revision, since there may be hundreds of revisions per hour.

What would the best practice be for a problem like this? I'd be willing to use a nonrelational database if it would be clearly superior.

Best Answer

Trying to do FKs will just frustrate you - you don't actually need it.

If you handle versioning by using an InsertDateTime concept, then you're basically describing a Type2 dimension as used by many data warehouses. There is quite a lot of material out there about tuning systems that sit over the top of data warehouses, but from a T-SQL perspective, consider using APPLY. Like this:

SELECT ...
FROM ... AS t
CROSS APPLY (
    SELECT TOP (1) ...
    FROM ... AS rev
    WHERE rev.UID = t.UID
    AND rev.InsertDateTime < @revdatetime
    ORDER BY rev.InsertDateTime DESC) AS something
WHERE something.IsDeleted = 'N'

This kind of construct benefits from an index on (IsDeleted, UID, InsertDateTime DESC).

Alternatively, if you want to look at the whole table as at a particular time, use ROW_NUMBER() like:

WITH numbered AS
 (  SELECT *, ROW_NUMBER() OVER (PARTITION BY UID ORDER BY InsertDateTime DESC) AS rownum
    FROM ...
    WHERE InsertDateTime < @revDateTime
),
VersionAtTime AS
(   SELECT *
    FROM numbered
    WHERE rownum = 1
    AND IsDeleted = 'N'
)
SELECT ...
FROM VersionAtTime
....

(edited from original to move the IsDeleted predicates outside the sub-queries, removing all versions at that particular time)

Related Solutions

Expanding parent-child tree with additional tables

Here's an idea: build your "expansion table" with something like this:

expanded_table (id number, parent_src char(1), parent_id number, ...)

The parent_src column should indicate where the parent comes from, e.g. you could use 'O' of original, 'E' for expansion. (Add a primary key on id, of course.)

To handle the querying, use a view defined like this (pseudo-sql):

select 'O' source, o.id, 'O' parent_src, o.parent_id, [...] from original_table
union all
select 'E' source, e.id,   e.parent_src, e.parent_id, [...] from expanded_table

You can do hierarchical queries based on (source,id)/(parent_src,parend_id) pairs.

No problem with duplicate/overlapping ids, the source info disambiguates them.

Now data integrity is going to be very problematic.

You can't have check constraints that reference other tables, and I don't know of a way of building a "conditional foreign key" that would work here.
You could use triggers to make sure DML on the expanded table is coherent, but even with those and triggers on the original table (if you could do that, which doesn't seem to be the case), that would be very tricky to get right.

Having all DML to this "structure" go through PL/SQL that you control could do it, but that doesn't appear to be possible given the information in your post.

How to design a database that easily lets me recreate historical snapshots

Versioning. Not a new problem - used thousands of times per day by developers using any version control system.

Mark any item with 2 fields - ValidFromNo and ValidToNo. The last can be 0.

Keep a table versions that has the version numbers. ValidFromNo is from which version number an item is valid, ValidToNo - which can be null - is the last valid version. Null marks "Currently in use" so you do not have to update all valid items on every new versin.

Simple like that.

Best Answer

Related Solutions

Expanding parent-child tree with additional tables

How to design a database that easily lets me recreate historical snapshots

Related Question