MySQL Versioning System – Implementing a Versioning System with MySQL

database-designMySQL

I know this has been asked here and here, but I have the same idea with a different possible implementation and I need some help.

Initially I had my blogstories table with this structure:

| Column    | Type        | Description                                    |
|-----------|-------------|------------------------------------------------|
| uid       | varchar(15) | 15 characters unique generated id              |
| title     | varchar(60) | story title                                    |
| content   | longtext    | story content                                  |
| author    | varchar(10) | id of the user that originally wrote the story |
| timestamp | int         | integer generated with microtime()             |

After I decided I wanted to implement some versioning system for every story on the blog, the first thing that came to my mind was creating a different table to hold edits; after that, I thought I could modify the existing table to hold versions instead of edits. This is the structure that came to my mind:

| Column        | Type          | Description                                       |
|------------   |-------------  |------------------------------------------------   |
| story_id      | varchar(15)   | 15 characters unique generated id                 |
| version_id    | varchar(5)    | 5 characters unique generated id                  |
| editor_id     | varchar(10)   | id of the user that commited                      |
| author_id     | varchar(10)   | id of the user that originally wrote the story    |
| timestamp     | int           | integer generated with microtime()                |
| title         | varchar(60)   | current story title                               |
| content       | longtext      | current story text                                |
| coverimg      | varchar(20)   | cover image name                                  |

The reasons why I came here:

The uid field of the initial table was UNIQUE in the table. Now, the story_id is not unique anymore. How should I deal with that? (I thought I could address story_id = x and then find the latest version, but that seems very resource consuming, so please give your advice)
author_id field value is repeating in each every row of the table. Where and how should I keep it?

Edit

The unique codes generation process is in the CreateUniqueCode function:

trait UIDFactory {
  public function CryptoRand(int $min, int $max): int {
    $range = $max - $min;
    if ($range < 1) return $min;
    $log = ceil(log($range, 2));
    $bytes = (int) ($log / 8) + 1;
    $bits = (int) $log + 1;
    $filter = (int) (1 << $bits) - 1;
    do {
        $rnd = hexdec(bin2hex(openssl_random_pseudo_bytes($bytes)));
        $rnd = $rnd & $filter;
    } while ($rnd >= $range);
    return $min + $rnd;
  }
  public function CreateUID(int $length): string {
    $token = "";
    $codeAlphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    $codeAlphabet.= "abcdefghijklmnopqrstuvwxyz";
    $codeAlphabet.= "0123456789";
    $max = strlen($codeAlphabet) - 1;
    for ($i=0; $i < $length; $i++) {
        $token .= $codeAlphabet[$this->CryptoRand(0, $max)];
    }
    return $token;
  }
}

The code is written in Hack, and was originally written in PHP by @Scott in his answer.

The fields author_id and editor_id can be different, because there are users with enough permissions to edit anyone's stories.

Best Answer

Analyzing the scenario —which presents characteristics associated with the subject known as temporal databases— from a conceptual perspective, one can determine that: (a) a “present” Blog Story Version and (b) a “past” Blog Story Version, although very resembling, are entities of different types.

In addition to that, when working at the logical level of abstraction, facts (represented by rows) of distinct kinds must be retained in distinct tables. In the case under consideration, even when quite similar, (i) facts about “present” Versions are different from (ii) facts about “past” Versions.

Therefore I recommend managing the situation by means of two tables:

one dedicated exclusively for the “current” or “present” Versions of the Blog Stories, and
one that is separate, but also linked with the other, for all the “previous” or “past” Versions;

each with (1) a slightly distinct number of columns and (2) a different group of constraints.

Back to the conceptual layer, I consider that —in your business environment— Author and Editor are notions that can be delineated as Roles that can be played by a User, and these important aspects depend on data derivation (via logical-level manipulation operations) and interpretation (carried out by the Blog Stories readers and writers, at the external level of the computerized information system, with the assistance of one or more application programs).

I will detail all these factors and other relevant points as follows.

Business rules

According to my understanding of your requirements, the following business rules formulations (put together in terms of the relevant entity types and their kinds of interrelationships) are specially helpful in establishing the corresponding conceptual schema:

A User writes zero-one-or-many BlogStories
A BlogStory holds zero-one-or-many BlogStoryVersions
A User wrote zero-one-or-many BlogStoryVersions

Expository IDEF1X diagram

Consequently, in order to expound my suggestion by virtue of a graphical device, I have created a sample IDEF1X^a diagram that is derived from the business rules formulated above and other features that seem pertinent. It is shown in Figure 1:

Figure 1 - Blog Story Versions IDEF1X Diagram

Why are BlogStory and BlogStoryVersion conceptualized as two different entity types?

Because:

A BlogStoryVersion instance (i.e., a “past” one) always holds a value for an UpdatedDateTime property, while a BlogStory occurrence (i.e., a “present” one) never holds it.
Besides, the entities of those types are uniquely identified by the values of two distinct sets of properties: BlogStoryNumber (in the case of the BlogStory occurrences), and BlogStoryNumber plus CreatedDateTime (in the case of the BlogStoryVersion instances).

_{^a Integration Definition for Information Modeling (IDEF1X) is a highly recommendable data modeling technique that was established as a standard in December 1993 by the United States National Institute of Standards and Technology (NIST). It is based on the early theoretical material authored by the sole originator of the relational model, i.e., Dr. E. F. Codd; on the Entity-Relationship view of data, developed by Dr. P. P. Chen; and also on the Logical Database Design Technique, created by Robert G. Brown.}

Illustrative logical SQL-DDL layout

Then, based on the conceptual analysis previously presented, I declared the logical-level design below:

-- You should determine which are the most fitting 
-- data types and sizes for all your table columns 
-- depending on your business context characteristics.

-- Also you should make accurate tests to define the most
-- convenient index strategies at the physical level.

-- As one would expect, you are free to make use of 
-- your preferred (or required) naming conventions.    

CREATE TABLE UserProfile (
    UserId          INT      NOT NULL,
    FirstName       CHAR(30) NOT NULL,
    LastName        CHAR(30) NOT NULL,
    BirthDate       DATETIME NOT NULL,
    GenderCode      CHAR(3)  NOT NULL,
    UserName        CHAR(20) NOT NULL,
    CreatedDateTime DATETIME NOT NULL,
    --
    CONSTRAINT UserProfile_PK  PRIMARY KEY (UserId),
    CONSTRAINT UserProfile_AK1 UNIQUE ( -- Composite ALTERNATE KEY.
        FirstName,
        LastName,
        BirthDate,
        GenderCode
    ), 
    CONSTRAINT UserProfile_AK2 UNIQUE (UserName) -- ALTERNATE KEY.
);

CREATE TABLE BlogStory (
    BlogStoryNumber INT      NOT NULL,
    Title           CHAR(60) NOT NULL,
    Content         TEXT     NOT NULL,
    CoverImageName  CHAR(30) NOT NULL,
    IsActive        BIT(1)   NOT NULL,
    AuthorId        INT      NOT NULL,
    CreatedDateTime DATETIME NOT NULL,
    --
    CONSTRAINT BlogStory_PK              PRIMARY KEY (BlogStoryNumber),
    CONSTRAINT BlogStory_AK              UNIQUE      (Title), -- ALTERNATE KEY.
    CONSTRAINT BlogStoryToUserProfile_FK FOREIGN KEY (AuthorId)
        REFERENCES UserProfile (UserId)
);

CREATE TABLE BlogStoryVersion  (
    BlogStoryNumber INT      NOT NULL,
    CreatedDateTime DATETIME NOT NULL,
    Title           CHAR(60) NOT NULL,
    Content         TEXT     NOT NULL,
    CoverImageName  CHAR(30) NOT NULL,
    IsActive        BIT(1)   NOT NULL,
    AuthorId        INT      NOT NULL,
    UpdatedDateTime DATETIME NOT NULL,
    --
    CONSTRAINT BlogStoryVersion_PK              PRIMARY KEY (BlogStoryNumber, CreatedDateTime), -- Composite PK.
    CONSTRAINT BlogStoryVersionToBlogStory_FK   FOREIGN KEY (BlogStoryNumber)
        REFERENCES BlogStory (BlogStoryNumber),
    CONSTRAINT BlogStoryVersionToUserProfile_FK FOREIGN KEY (AuthorId)
        REFERENCES UserProfile (UserId),
    CONSTRAINT DatesSuccession_CK               CHECK       (UpdatedDateTime > CreatedDateTime) --Let us hope that MySQL will finally enforce CHECK constraints in a near future version.
);

Tested in this SQL Fiddle that runs on MySQL 5.6.

The BlogStory table

As you can see in the demo design, I have defined the BlogStory PRIMARY KEY (PK for brevity) column with the INT datatype. In this regard, you may like to fix a built-in automatic process that generates and assigns a numeric value for such a column in every row insertion. If you do not mind leaving gaps occasionally in this set of values, then you can employ the AUTO_INCREMENT attribute, commonly used in MySQL environments.

When entering all your individual BlogStory.CreatedDateTime data points, you can utilize the NOW() function, which returns the Date and Time values that are current in the database server at the exact INSERT operation instant. To me, this practice is decidedly more suitable and less prone to errors than the use of external routines.

Provided that, as discussed in (now-removed) comments, you want to avoid the possibility of maintaining BlogStory.Title duplicate values, you have to set up a UNIQUE constraint for this column. Due to the fact that a given Title may be shared by several (or even all of the) “past” BlogStoryVersions, then a UNIQUE constraint should not be established for the BlogStoryVersion.Title column.

I included the BlogStory.IsActive column of type BIT(1) (though a TINYINT may as well be used) in case you need to provide “soft” or “logical” DELETE functionality.

Details about the BlogStoryVersion table

On the other hand, the PK of the BlogStoryVersion table is composed of (a) BlogStoryNumber and (b) a column named CreatedDateTime that, of course, marks the precise instant in which a BlogStory row underwent an INSERT.

BlogStoryVersion.BlogStoryNumber, besides being part of the PK, is also constrained as a FOREIGN KEY (FK) that references BlogStory.BlogStoryNumber, a configuration that enforces referential integrity between the rows of these two tables. In this respect, implementing an automatic generation of a BlogStoryVersion.BlogStoryNumber is not necessary because, being set as a FK, the values INSERTed into this column must be “drawn from” the ones already enclosed in the related BlogStory.BlogStoryNumber counterpart.

The BlogStoryVersion.UpdatedDateTime column should retain, as expected, the point in time when a BlogStory row was modified and, as a consequence, added to the BlogStoryVersion table. Hence, you can use the NOW() function in this situation too.

The Interval comprehended between BlogStoryVersion.CreatedDateTime and BlogStoryVersion.UpdatedDateTime expresses the entire Period during which a BlogStory row was “present” or “current”.

Considerations for a Version column

It can be useful to think of BlogStoryVersion.CreatedDateTime as the column that holds the value that represents a particular “past” Version of a BlogStory. I deem this much more beneficial than a VersionId or VersionCode, since it is user-friendlier in the sense that people tend to be more familiar with time concepts. For instance, the blog authors or readers could refer to a BlogStoryVersion in a fashion similar to the following:

“I want to see the specific Version of the BlogStory identified by Number 1750 that was Created on 26 August 2015 at 9:30”.

The Author and Editor Roles: Data derivation and interpretation

With this approach, you can easily distinguish who holds the “original” AuthorId of a concrete BlogStory SELECTing the “earliest” Version of a certain BlogStoryId FROM the BlogStoryVersion table by virtue of applying the MIN() function to BlogStoryVersion.CreatedDateTime.

In this way, each BlogStoryVersion.AuthorId value contained in all the “later” or “succeeding” Versions rows indicate, naturally, the Author identifier of the respective Version at hand, but one can also say that such a value is, at the same time, denoting the Role played by the involved User as Editor of the “original” Version of a BlogStory.

Yes, a given AuthorId value may be shared by multiple BlogStoryVersion rows, but this is actually a piece of information that tells something very significant about each Version, so the repetition of said datum is not a problem.

The format of DATETIME columns

As for the DATETIME data type, yes, you are right, “MySQL retrieves and displays DATETIME values in 'YYYY-MM-DD HH:MM:SS' format”, but you can confidently enter the pertinent data in this manner, and when you have to perform a query you just have to make use of the built-in DATE and TIME functions in order to, among other things, show the concerning values in the appropriate format for your users. Or you could certainly carry out this kind of data formatting via your application programms(s) code.

Implications of BlogStory UPDATE operations

Every time that a BlogStory row suffers an UPDATE, you must ensure that the corresponding values that were “present” until the modification took place are then INSERTed into the BlogStoryVersion table. Thus, I highly suggest fulfilling these operations within a single ACID TRANSACTION to guarantee that they are treated as an indivisible Unit of Work. You may as well employ TRIGGERS, but they tend to make things untidy, so to speak.

Introducing a `VersionId` or `VersionCode` column

If you opt (because of business circumstances or personal preference) to incorporate a BlogStory.VersionId or BlogStory.VersionCode column to distinguish the BlogStoryVersions, you should ponder the following possibilities:

A VersionCode could be required to be UNIQUE in (i) the whole BlogStory table and also in (ii) BlogStoryVersion.

Therefore, you have to implement a carefully tested and totally reliable method in order to generate and assign each Code value.
Maybe, the VersionCode values could be repeated in different BlogStory rows, but never duplicated along with the same BlogStoryNumber. E.g., you could have:
- a BlogStoryNumber 3 - Version 83o7c5c and, simultaneously,
- a BlogStoryNumber 86 - Version 83o7c5c and
- a BlogStoryNumber 958 - Version 83o7c5c.

The later possibility opens another alternative:

Keeping a VersionNumber for the BlogStories, so there could be:
- BlogStoryNumber 23 - Versions 1, 2, 3…;
- BlogStoryNumber 650 - Versions 1, 2, 3…;
- BlogStoryNumber 2254 - Versions 1, 2, 3…;
- etc.

Holding “original” and “subsequent” versions in a single table

Although maintaining all the BlogStoryVersions in the same individual base table is possible, I suggest not to do it because you would be mixing two distinct (conceptual) types of facts, which thus has undesirable side-effects on

data constraints and manipulation (at the logical level), along with
the related processing and storage (at the physical tier).

But, on condition that you choose to follow that course of action, you can still take advantage of many of the ideas detailed above, e.g.:

a composite PK consisting of an INT column (BlogStoryNumber) and a DATETIME column (CreatedDateTime);
the usage of server functions in order to optimize the pertinent processes, and
the Author and Editor derivable Roles.

Seeing that, by proceeding with such an approach, a BlogStoryNumber value will be duplicated as soon as “newer” Versions are added, an option that and that you could evaluate (which is very alike to those mentioned in the previous section) is establishing a BlogStory PK composed of the columns BlogStoryNumber and VersionCode, in this manner you would be able to uniquely identify each Version of a BlogStory. And you can try with a combination of BlogStoryNumber and VersionNumber too.

Similar scenario

You may find my answer to this question of help, since I as well propose enabling temporal capabilities in the concerning database to deal with a comparable scenario.

Best Answer

Business rules

Illustrative logical SQL-DDL layout

Introducing a VersionId or VersionCode column

Holding “original” and “subsequent” versions in a single table

Similar scenario

Related Solutions

Mysql – Implementing something similar to “pointer arrays” with MySQL

Mysql – Implementing a Record Keeper/Rollback System for a MySQL Database

Related Question

Introducing a `VersionId` or `VersionCode` column