Sql-server – Design of portable tables with validity interval (historization, temporal databases)

database-designindexoraclepostgresqlsql server

I'm designing a data model for an application which must keep track about the changes of data.

In a first step, my application must support PostgreSQL, but I'd like to add support for other RDBMS (especially Oracle and MS SQL server) in a second step. Therefore, I'd like to choose a portable data model with less usage of proprietary features. (The DDL for the tables may be different from RDBMS vendor to RDBMS vendor. But the SQL queries / statements in the application should be the same for all supported vendors, as far as possible.)

For example, let's say there is a users and a users_versions table. users_versions has a foreign key on users.

An example of the tables could look like:

users
----------------
id | username
---------------- 
 1 | johndoe
 2 | sally

users_versions --> references id of user (userid)
---------------------------------------------------------------------------
id | userid | name     | street      | place     | validfrom  | validuntil
---------------------------------------------------------------------------
 1 |      1 | John Doe | 2nd Fake St | Faketown  | 2018-01-04 | 2018-01-05
 2 |      1 | John Doe | Real St 23  | Faketown  | 2018-01-05 | null
 3 |      2 | Sally Wu | Main St 1   | Lake Fake | 2018-04-02 | 2018-04-20
 4 |      2 | Sally Wu | Other St 99 | Chicago   | 2018-04-20 | null

Most SQL queries will query for the entries currently valid. In the concept example above this woule look like

SELECT *
  FROM users_versions uv 
  INNER JOIN users u ON u.id = uv.userid
  WHERE uv.userid = 123 AND uv.validuntil IS NULL;

Some use cases (reporting etc.) will require SELECTing a historic version of data, as well (e.g. what data were valid at 2017-12-31?). But these won't be performance critical in my application.

In the example above, I might create a filtered unique index on validuntil to ensure that there is only 1 entry with unlimited validity at a time:

CREATE UNIQUE INDEX foo
  ON users_versions ( userid ) 
  WHERE validuntil IS NULL;

As far as I know, filtered indexes can only be used for query optimization in PostgreSQL and MS SQL but not in Oracle. Moreover, indexing null might be a tricky thing, as well (possible / only in multi-column indexes / not-possible).

Therefore, a different approach for users_versions might be the structure above plus an explicit valid column managed by the application. The most recent entry would get a 1, all historic entries would get an 0. Then I could create two indices, one for query optimization and one for integrity enforcement (only 1 valid entry at a time):

CREATE INDEX optimization
  ON users_versions ( userid, valid );

For queries like:

SELECT *
  FROM users_versions uv 
  INNER JOIN users u ON u.id = uv.userid
  WHERE uv.userid = 123 AND uv.valid = 1;

And one more index to enforce the current version integrity (e.g. ORACLE version):

-- ORACLE: Entry with null-only columns ignored in indexing:
CREATE UNIQUE INDEX only_one_valid_version_per_user
  ON users_versions ( 
    CASE WHEN valid = 1 THEN userid ELSE null END,
    CASE WHEN valid = 1 THEN valid  ELSE null END
  );

Probably this index cannot be used for query optimization, but it should ensure that there can only be 1 valid entry per userid but an unlimited amount of invalid entries (valid=0) for the same userid.

What's your suggestion for a portable design of such history tables which allows performance in usage?

  • validfrom + validuntil, with validuntil (nullable) set to null in the currently valid entry
  • validfrom + validuntil, with validuntil (not nullable) set to far future date like 2999-12-31 in the currently valid entry
  • validfrom + validuntil + valid flag, with valid flag managed by the application and used in queries for the currently valid entry
  • …?

When INSERTing new versions, my application will always perform two steps:

  • Invalidate current version (set validuntil to current date (plus, optionally, set valid flag to 0))
  • Insert new version (validfrom current date, plus, optionally, with valid flag 1)

I don't have the requirement that the database enforces overlap-free time intervals for historic entries. I only must make sure that there is only 1 entry with unlimited validity.

For some very large tables, it might be worth splitting into current and history table: One table only contains the currently valid versions (users_versions_current), one other contains all the historic versions (users_versions_history). Whenever a new versions is inserted, the previous version is inserted with validfrom/validuntil into the ..._history table.

What aspects should I consider?
Do you know literature, best practice recommendations etc.?

Best Answer

I must say that I agree with the spirit of other answers, and I think that you should first focus on building an optimal database with one specific database management system (DBMS) in mind; the portability aspect, although important, should be secondary.


According to the content of your question, you appear to be very familiar with the subject. Anyway, I have shared my take on two scenarios involving temporal capabilities in this post and also in this other post (containing sample diagrams, expository DDL code, etc.), in case you want to take a look and establish some analogies.

Conceptual examination

Starting the analysis at the conceptual level, the business rules under consideration can be formulated as follows:

  • There can be one-to-many Users
  • A User holds exactly-one CurrentVersion
  • A User holds zero-or-many PastVersions

As demonstrated, the entity types CurrentVersion and PastVersion are involved in a one-to-zero-or-many (or zero-or-many-to-one) association. Apart from the cardinalities, it can be inferred that we are dealing with two distinct entity types because, in this case, a CurrentVersion instance does not have a ValidUntil property, while all instances of PastVersion must have it.

Logical-level arrangement

So, I suggest (a) one base table for the “current version” rows and (b) one base table for the “past version” ones. In this way, the assertions (i.e., rows) retained in each table represent what is a clearly different —although associated— kind of fact (as per the relational model theory), avoiding the ad hoc introduction of ambiguities in a single table.

Considering the user example you brought up —and in agreement with the conceptual definitions above—, the structure of the two tables would be almost the same, but user_version (i.e., the one for the “past” versions) includes an additional valid_until column, which along with user_id must make up the composite PRIMARY KEY of said table. The user_version.user_id column must be constrained as a FOREIGN KEY referencing user.user_id.

Manipulation

When a most “up-to-date” version has to be “saved“, the whole row of the “previous” version undergoes an INSERT operation INTO the user_version table, attaching the corresponding valid_until value indicating the exact instant when the operation is carried out. In turn, the values of the “preceding“ row at the user (i.e., “current“) table are replaced with the “most recent“ ones, by means of an UPDATE.

Each row in the user table would cover the need for unlimited validity that you have to ensure (not having a valid_until column, the values remain valid up to the moment when they are UPDATEd, which may never arrive).

Integrity

Of course, the sequentiality of the associated values has to be taken care of (e.g., preventing overlaps, rejecting invalid dates, etc.), just like overall integrity. I would make use of ACID Transactions to guarantee that the pertinent operations are treated as a single Unit of Work within the DBMS itself. Stored procedures (or functions in Postgres) with appropriate permissions would as well be very helpful.

There is no need for NULLable columns —a table with NULL marks does not portray a mathematical relation, so one cannot expect that it behaves as such, it can be normalized, etc.—, nor for a valid column managed by one (or more) application program(s) —which would endanger data quality by violating the principle of self-protection of a database—.

Derivability

The period comprehended between the values in user_version.valid_from and user_version.valid_until stands for the entire validity_interval during which a certain “past” row was “current” or “effective” (it can be calculated in days, minutes, etc. and may be incorporated into a view or computed in application program [app] code as convenient). This and other relevant aspects imply deriving data by virtue of data manipulation operations, mostly SELECTs and a few subqueries.

Accessing the database from the external level by way of one or more apps

Constructing, let us say, an object-oriented programming “intermediate tier” consumed, in turn, by a “higher tier” of one or more apps (or another kind of software component) would as well help in database portability, allowing code reuse and considerable isolation from migrations to other DBMSs. This resource about the Repository Pattern in .NET (C#) can bring about some ideas in this respect.

Portability considerations

It is important to draw a distinction between two of the different levels of abstraction of a database built on a SQL DBMS. The (1) structure and (2) constraints of the tables along with (3) the data manipulation operations —INSERT, SELECT, UPDATE, DELETE, combinations thereof— effectuated on the tables are elements of the logical level. The (4) underlying indexes supporting a table and/or constraints are (“lower“) physical-level components.

In this manner, the same logical design principle is applicable in all the major SQL platforms but, as you mention in the question, the differences between the tools provided by the various SQL DBMSs for creating logical elements are mostly syntactic, so the portability would be affected by some dialect-specific SQL (DDL and DML) features (and perhaps by DBMS-specific data type characteristics and names as well), thus the convenient approach would be to write SQL code that complies with the ISO/IEC/ANSI standard syntax whenever feasible.

Other problem that you will face is that the same (logical) query would be executed differently (at the physical level) depending on the particular DBMS of use, so the response time would vary greatly and, hence, you will have to make some rewrites to improve speed.

With regards to the the physical-level mechanisms, yes, each of the SQL platforms offers different kinds of indexes, although you should make sure that the platform-specific indexing settings do not affect the logical-level layout of a database be it (i) when creating/changing indexes on the same DBMS or (ii) when porting a certain database to another DBMS (this point is related to the subject known as physical data independence).

In this respect, it would be very handy to have a powerful Storage Definition Language (SDL) completely independent from the Data Definition Language (DDL), which would facilitate achieving a clear separation of concerns between the logical and physical tiers, but that is a different story, so you should try to separate the code of the logical declarations from the code of the physical settings as much as possible to assist portability —hard to accomplish with current DDL mixture of characteristics, I know—.

Speed

Furthermore, it is at the physical level of abstraction where you should optimize the performance of a database (via single- or multi-column indexes, upgrading network bandwidth, improving operating system and/or DBMS and/or hardware configurations, etc.), without damaging the quality of the logical structure and constraints and, therefore, putting (1) the integrity of the data and (2) result set reliability at risk. The logical coherence is a paramount factor in general performance, and a piece of software providing incoherent information can hardly be deemed a database, it does not matter if it “works” particularly fast. Database reliability and speed go, decidedly, hand in hand.

Observation

As for the expository data provided in your question, it looks like the users_versions.id column is excessive, since it appears to be an extra column meant to retain system-controlled surrogate keys (e.g., a column with an IDENTITY property in a SQL Server table), making that table logically more wide than necessary, which implies “heavier” structures (in terms of bytes) at the physical level (e.g., a supplementary index), slowing down the execution of data manipulation operations.

In addition, since a surrogate key value is meaningless, its enclosing column would unlikely be specified as a condition in WHERE clauses of SELECT operations (in contrast, most of the queries will probably include users_versions.user_id and/or valid_from and/or valid_until, both comprising the “natural” PRIMARY KEY), so users_versions.id would practically add no benefit at all, it would in fact be a burden demanding needless management. In light of all of the above, I consider that this is another factor that you should take into account to optimize overall system functioning and administration.

My more detailed take on columns for system-controlled surrogate keys is contained in this answer and in this answer too, in case you are interested.

Enabling temporal capabilities exclusively for one column

There are situations where you have to enable temporal capabilities for only one column, so this post and this post may serve as references too.