Database Design – Tracking History Without Losing Referential Integrity

database-designreferential-integrity

To make it simple, let’s say we have these 2 tables:

CREATE TABLE `tbl_companies` (
  `id` int(11) NOT NULL,
  `name` varchar(100) NOT NULL,
  PRIMARY KEY (`id`)
);

CREATE TABLE `tbl_workers` (
  `id` int(11) NOT NULL,
  `name` int(11) NOT NULL,
  PRIMARY KEY (`id`)
);

And another 2 tables to store the relationship between companies and workers

CREATE TABLE `tbl_companies_workers` (
  `company_id` int(11) NOT NULL,
  `worker_id` int(11) NOT NULL,
  PRIMARY KEY (`company_id `, `worker_id `)
);

CREATE TABLE `tbl_companies_workers_history` (
  `company_id` int(11) NOT NULL,
  `worker_id` int(11) NOT NULL,
  `start` date NOT NULL,
  `end` date DEFAULT NULL,
  PRIMARY KEY (`company_id `, `worker_id `)
)

With this schema it’s easy to display something like: “John Smith worked at MyComp from 2010 to 2012”. But if MyComp is deleted from de DB, then that information is deleted too. I’d like to be able to display the same info even though the company is deleted.

Is it possible to achieve this without losing referential integrity? What would be the best approach? I've been looking for some info and I found this and this, but still it’s not clear to me.

Thanks in advance.

Best Answer

This is a long-standing problem with lots of awkward solutions like the ones you referenced. The best ones still leave the problem of referential integrity. How do you reference an entity that changes over time and you want to maintain all the versions?

The solution actually turns out to be very simple: normalization. When we normalize an entity, there is a master entity table. This contains the PK of the entity and all other attributes. All FKs from other tables refer to this table as there is one and only one entry for any given entity.

After normalization, this table still exists but there is (at least) one other table that contains some of the attributes. There could be zero, one or several entries for any unique entity but that is fine. In fact, the PK of this table is the PK of the entity (which is also a FK back to the master entity table) and some other attribute.

So now you implement Version Normal Form. Take all the attributes of the main entity and separate out those attributes that change over time and you want to track those changes. What you have is the master entity table with the one and only PK of the entity (referential integrity works as normal) and all attributes that either don't change over time or the changes aren't tracked. Then the "version" table contains all the attributes that change and are tracked. The PK of this table is the entity PK and the change date (the date and time the version took affect).

A partial list of advantages of this scheme are

  • It uses normalization, a process we all know and work with daily.
  • Referential integrity is not affected.
  • Current and past versions are in the same table. In fact, the query to access the current state of the data and the query to look back into the past is the same query.
  • The change date can be set to the future for those instances when a state change is known ahead of time. Such planned changes do not appear in queries until the change time is reached.

There are two methods of "soft delete" that can be used. A "deleted" datetime attribute can be added to the master record with the default being NULL or (as I prefer) the magic date 9999-12-31. This is what I call a "firm" delete because it should not be undo-able. The other method is to add an "isDeleted" attribute to the version record. A delete operation would insert a version with "isDeleted" set to TRUE (or 'Y' or whatever). The deletion would be just another version. So to undelete, a new version with "isDeleted" turned off would bring the entity back into existence -- with the times it was deleted and then undeleted maintained like any other state change. Of course, both methods may be used if needed.

Here are the slides of a presentation I have made a few times at tech fairs. It covers how all the above can be done including the queries. And here is a document that goes into a lot more detail.