I'm designing a data model for an application which must keep track about the changes of data.
In a first step, my application must support PostgreSQL, but I'd like to add support for other RDBMS (especially Oracle and MS SQL server) in a second step. Therefore, I'd like to choose a portable data model with less usage of proprietary features. (The DDL for the tables may be different from RDBMS vendor to RDBMS vendor. But the SQL queries / statements in the application should be the same for all supported vendors, as far as possible.)
For example, let's say there is a users
and a users_versions
table. users_versions
has a foreign key on users
.
An example of the tables could look like:
users
----------------
id | username
----------------
1 | johndoe
2 | sally
users_versions --> references id of user (userid)
---------------------------------------------------------------------------
id | userid | name | street | place | validfrom | validuntil
---------------------------------------------------------------------------
1 | 1 | John Doe | 2nd Fake St | Faketown | 2018-01-04 | 2018-01-05
2 | 1 | John Doe | Real St 23 | Faketown | 2018-01-05 | null
3 | 2 | Sally Wu | Main St 1 | Lake Fake | 2018-04-02 | 2018-04-20
4 | 2 | Sally Wu | Other St 99 | Chicago | 2018-04-20 | null
Most SQL queries will query for the entries currently valid. In the concept example above this woule look like
SELECT *
FROM users_versions uv
INNER JOIN users u ON u.id = uv.userid
WHERE uv.userid = 123 AND uv.validuntil IS NULL;
Some use cases (reporting etc.) will require SELECTing a historic version of data, as well (e.g. what data were valid at 2017-12-31
?). But these won't be performance critical in my application.
In the example above, I might create a filtered unique index on validuntil
to ensure that there is only 1 entry with unlimited validity at a time:
CREATE UNIQUE INDEX foo
ON users_versions ( userid )
WHERE validuntil IS NULL;
As far as I know, filtered indexes can only be used for query optimization in PostgreSQL and MS SQL but not in Oracle. Moreover, indexing null
might be a tricky thing, as well (possible / only in multi-column indexes / not-possible).
Therefore, a different approach for users_versions
might be the structure above plus an explicit valid
column managed by the application. The most recent entry would get a 1
, all historic entries would get an 0
. Then I could create two indices, one for query optimization and one for integrity enforcement (only 1 valid entry at a time):
CREATE INDEX optimization
ON users_versions ( userid, valid );
For queries like:
SELECT *
FROM users_versions uv
INNER JOIN users u ON u.id = uv.userid
WHERE uv.userid = 123 AND uv.valid = 1;
And one more index to enforce the current version integrity (e.g. ORACLE version):
-- ORACLE: Entry with null-only columns ignored in indexing:
CREATE UNIQUE INDEX only_one_valid_version_per_user
ON users_versions (
CASE WHEN valid = 1 THEN userid ELSE null END,
CASE WHEN valid = 1 THEN valid ELSE null END
);
Probably this index cannot be used for query optimization, but it should ensure that there can only be 1 valid entry per userid
but an unlimited amount of invalid entries (valid=0) for the same userid
.
What's your suggestion for a portable design of such history tables which allows performance in usage?
validfrom
+validuntil
, withvaliduntil
(nullable) set tonull
in the currently valid entryvalidfrom
+validuntil
, withvaliduntil
(not nullable) set to far future date like2999-12-31
in the currently valid entryvalidfrom
+validuntil
+valid
flag, withvalid
flag managed by the application and used in queries for the currently valid entry- …?
When INSERTing new versions, my application will always perform two steps:
- Invalidate current version (set
validuntil
to current date (plus, optionally, setvalid
flag to0
)) - Insert new version (
validfrom
current date
, plus, optionally, withvalid
flag1
)
I don't have the requirement that the database enforces overlap-free time intervals for historic entries. I only must make sure that there is only 1 entry with unlimited validity.
For some very large tables, it might be worth splitting into current
and history
table: One table only contains the currently valid versions (users_versions_current
), one other contains all the historic versions (users_versions_history
). Whenever a new versions is inserted, the previous version is inserted with validfrom
/validuntil
into the ..._history
table.
What aspects should I consider?
Do you know literature, best practice recommendations etc.?
Best Answer
I must say that I agree with the spirit of other answers, and I think that you should first focus on building an optimal database with one specific database management system (DBMS) in mind; the portability aspect, although important, should be secondary.
According to the content of your question, you appear to be very familiar with the subject. Anyway, I have shared my take on two scenarios involving temporal capabilities in this post and also in this other post (containing sample diagrams, expository DDL code, etc.), in case you want to take a look and establish some analogies.
Conceptual examination
Starting the analysis at the conceptual level, the business rules under consideration can be formulated as follows:
As demonstrated, the entity types CurrentVersion and PastVersion are involved in a one-to-zero-or-many (or zero-or-many-to-one) association. Apart from the cardinalities, it can be inferred that we are dealing with two distinct entity types because, in this case, a CurrentVersion instance does not have a ValidUntil property, while all instances of PastVersion must have it.
Logical-level arrangement
So, I suggest (a) one base table for the “current version” rows and (b) one base table for the “past version” ones. In this way, the assertions (i.e., rows) retained in each table represent what is a clearly different —although associated— kind of fact (as per the relational model theory), avoiding the ad hoc introduction of ambiguities in a single table.
Considering the
user
example you brought up —and in agreement with the conceptual definitions above—, the structure of the two tables would be almost the same, butuser_version
(i.e., the one for the “past” versions) includes an additionalvalid_until
column, which along withuser_id
must make up the composite PRIMARY KEY of said table. Theuser_version.user_id
column must be constrained as a FOREIGN KEY referencinguser.user_id
.Manipulation
When a most “up-to-date” version has to be “saved“, the whole row of the “previous” version undergoes an INSERT operation INTO the
user_version
table, attaching the correspondingvalid_until
value indicating the exact instant when the operation is carried out. In turn, the values of the “preceding“ row at theuser
(i.e., “current“) table are replaced with the “most recent“ ones, by means of an UPDATE.Each row in the
user
table would cover the need for unlimited validity that you have to ensure (not having avalid_until
column, the values remain valid up to the moment when they are UPDATEd, which may never arrive).Integrity
Of course, the sequentiality of the associated values has to be taken care of (e.g., preventing overlaps, rejecting invalid dates, etc.), just like overall integrity. I would make use of ACID Transactions to guarantee that the pertinent operations are treated as a single Unit of Work within the DBMS itself. Stored procedures (or functions in Postgres) with appropriate permissions would as well be very helpful.
There is no need for NULLable columns —a table with NULL marks does not portray a mathematical relation, so one cannot expect that it behaves as such, it can be normalized, etc.—, nor for a
valid
column managed by one (or more) application program(s) —which would endanger data quality by violating the principle of self-protection of a database—.Derivability
The period comprehended between the values in
user_version.valid_from
anduser_version.valid_until
stands for the entirevalidity_interval
during which a certain “past” row was “current” or “effective” (it can be calculated in days, minutes, etc. and may be incorporated into a view or computed in application program [app] code as convenient). This and other relevant aspects imply deriving data by virtue of data manipulation operations, mostly SELECTs and a few subqueries.Accessing the database from the external level by way of one or more apps
Constructing, let us say, an object-oriented programming “intermediate tier” consumed, in turn, by a “higher tier” of one or more apps (or another kind of software component) would as well help in database portability, allowing code reuse and considerable isolation from migrations to other DBMSs. This resource about the Repository Pattern in .NET (C#) can bring about some ideas in this respect.
Portability considerations
It is important to draw a distinction between two of the different levels of abstraction of a database built on a SQL DBMS. The (1) structure and (2) constraints of the tables along with (3) the data manipulation operations —INSERT, SELECT, UPDATE, DELETE, combinations thereof— effectuated on the tables are elements of the logical level. The (4) underlying indexes supporting a table and/or constraints are (“lower“) physical-level components.
In this manner, the same logical design principle is applicable in all the major SQL platforms but, as you mention in the question, the differences between the tools provided by the various SQL DBMSs for creating logical elements are mostly syntactic, so the portability would be affected by some dialect-specific SQL (DDL and DML) features (and perhaps by DBMS-specific data type characteristics and names as well), thus the convenient approach would be to write SQL code that complies with the ISO/IEC/ANSI standard syntax whenever feasible.
Other problem that you will face is that the same (logical) query would be executed differently (at the physical level) depending on the particular DBMS of use, so the response time would vary greatly and, hence, you will have to make some rewrites to improve speed.
With regards to the the physical-level mechanisms, yes, each of the SQL platforms offers different kinds of indexes, although you should make sure that the platform-specific indexing settings do not affect the logical-level layout of a database be it (i) when creating/changing indexes on the same DBMS or (ii) when porting a certain database to another DBMS (this point is related to the subject known as physical data independence).
In this respect, it would be very handy to have a powerful Storage Definition Language (SDL) completely independent from the Data Definition Language (DDL), which would facilitate achieving a clear separation of concerns between the logical and physical tiers, but that is a different story, so you should try to separate the code of the logical declarations from the code of the physical settings as much as possible to assist portability —hard to accomplish with current DDL mixture of characteristics, I know—.
Speed
Furthermore, it is at the physical level of abstraction where you should optimize the performance of a database (via single- or multi-column indexes, upgrading network bandwidth, improving operating system and/or DBMS and/or hardware configurations, etc.), without damaging the quality of the logical structure and constraints and, therefore, putting (1) the integrity of the data and (2) result set reliability at risk. The logical coherence is a paramount factor in general performance, and a piece of software providing incoherent information can hardly be deemed a database, it does not matter if it “works” particularly fast. Database reliability and speed go, decidedly, hand in hand.
Observation
As for the expository data provided in your question, it looks like the
users_versions.id
column is excessive, since it appears to be an extra column meant to retain system-controlled surrogate keys (e.g., a column with an IDENTITY property in a SQL Server table), making that table logically more wide than necessary, which implies “heavier” structures (in terms of bytes) at the physical level (e.g., a supplementary index), slowing down the execution of data manipulation operations.In addition, since a surrogate key value is meaningless, its enclosing column would unlikely be specified as a condition in WHERE clauses of SELECT operations (in contrast, most of the queries will probably include
users_versions.user_id
and/orvalid_from
and/orvalid_until
, both comprising the “natural” PRIMARY KEY), sousers_versions.id
would practically add no benefit at all, it would in fact be a burden demanding needless management. In light of all of the above, I consider that this is another factor that you should take into account to optimize overall system functioning and administration.My more detailed take on columns for system-controlled surrogate keys is contained in this answer and in this answer too, in case you are interested.
Enabling temporal capabilities exclusively for one column
There are situations where you have to enable temporal capabilities for only one column, so this post and this post may serve as references too.