Building a branched versioning model for relational databases

data-versioningdatabase-designversion control

I am database designer and at my current project I'm implementing versioning capabilities required to concurrently edit rows of data in RDBMS. The project requirements says, that data editing sessions can go on for several hours or days until performing commit. Also, conflicts are arising during simultaneous modifying of the same data by different users should be handled with possibility of manual and semi-automatic resolution. In other words, desired editing workflow is similar to one used in document-oriented version control systems, such as SVN or Git. Therefore, traditional OLTP approaches and conflict resolution strategies (MVCC, optimistic/pessimistic locks) doesn't satisfy my constraints. I have done some observation of existing tools, that offer possibilities for branched version history and multiversion workflow:

SQL:2011 doesn't solve my problem, as it offers support for "linear" history of edits, not branched I'm looking for. Solutions from ESRI and Oracle are good candidates, but I'm disappointed that both have vendor-specific interfaces for manipulating versions. It seems that at this moment nobody can offer industry standard solution for branched versioning of relational data (as SQL:2011 does for temporal tables and linear version history). As a newcoming database researcher, i want to understand:

  • is relational database community interested in developing standard models of branched data versioning and will any contribution or research in this area be valuable? (for example, standartization as it was done for temporal features in SQL2011 in the form of language improvements)
  • do developers and database designers lack for database-independent open-source middleware (similar to ArcSDE), that offers support for branched version management of relational data or it would be better to introduce such features in RDBMS itself?

I think I can try to dig deeper and propose some standard model or sublanguage to deal with Git-like versioning, but i don't know where to start.

Best Answer

IMO this problem lacks sufficient generality to build standards or reusable solutions. It's really just a problem you solve with data modeling,

Something along the lines of:

CREATE TABLE DOC
(
  DOC_ID INT PRIMARY KEY,
  CURRENT_VERSION_ID INT,
  MODIFIED_AT DATETIME,
  MODIFIED_BY INT,
  CONSTRAINT FK_CURRENT_VERSION 
    FOREIGN KEY (DOC_ID, CURRENT_VERSION)
    REFERENCES DOC_VERSION(DOC_ID,VERSION_ID)

)


CREATE TABLE DOC_VERSION
(
  DOC_ID INT REFERENCES DOC,
  VERSION_ID INT,
  VERSION_OWNER INT,
  SOURCE_VERSION_ID INT,
  DOC_CONTENT NVARCHAR(MAX),
  CONSTRAINT PK_DOC_VERSION PRIMARY KEY (DOC_ID,VERSION_ID)
)

If the content being versioned is not stored as a blob, but is in separate tables, the pattern still holds. All the tables that store the versioned data need the VERSION_ID.

It would certainly be an interesting research project to explore all the different ways this can be done, and discuss options for conflict resolution, merging changes and sketch out what a general solution might look like.