Sql-server – Where to store versioned/historized records

data-versioningsql-server-2012

In our application we are using versioning records using triggers. The application is supposed to handle millions of records in the future and the history/versioning tables are there to ensure that there is an audit trail for various transactions.

Currently everything is stored into one database. My concern is that these tables will bloat the transactional database, and I really want to move all record history into another database mainly because:

transactional database can be backed up more frequently since it is smaller.
indices can be scheduled to be rebuilt more frequently on the transactional db.
The history can be archived more easily if needed.

Are there any performance advantages to choosing this strategy or is the extra database another moving part which is not needed? What is the prescribed solution for this sort of thing?

Best Answer

We use Change Data Capture. SQL Server essentially creates journal tables and then automatically reads from the transaction log and fills in the journal tables with information about DML operations on the original table.

To save space, we compress the journal tables. This means your journal tables are stored in the same database, but if you are having space issues, you could offload them to a different file/filegroup and store that on a different disk.

The journal tables are in a separate schema so you could exclude that entire schema from your short-term backups and only include them in your long-term backups.

Related Solutions

Mysql – Relational modeled design to store versioned files

Your Altered_Files design is absolutely fine. In fact I would award bonus points for separating the files per se from the hierarchy. @chillworld's suggestion is workable and very common but yours is better because of the NULL issues you mention. I can't think of any update which is better in one schema or the other. Change a single row - either the file data or the files' relationship - and you have a single update on a single table. No risk to integrity there. Altered_files contains only primary keys (PK) and pointers to PKs. By definition these should never change. On the off chance you should have to change one then, yes, you have to be very, very careful to maintain consistency. For INSERTs and DELETEs wrap the statements in a transaction. This is precisely why transactions exist.

I would suggest you store the full directory path for an altered image, not just its new sub-folder. What happens if you derive image 4 from image 3, from image2, from image 1? Will contact('-') handle that? Will the continuous constructing of image 4's ancestry be a burden on the system? You know your data, workflow and hardware best. If the volumes are small this may not be an issue.

The file-in-database-or folder is an age-old one. SQL Server has functionality called FILESTREAM where the image passes through the database on the way to the file system. It is held on disk as a JPG or PNG or whatever and can be read by normal applications. The database retains ownership of it and its related data. The best of both worlds. I don't think MySQL has anything similar, however. I think you may want to grant other applications read access to the image filestore and reserve write access for the DB.

PostgreSQL Data Versioning – Updating Table of Versioned Rows with Historical Records

1st case

You seem to forget the valid_during range. As your third case suggests, there can be multiple entries per (rec_id, val), so you must select the right one:

UPDATE master m
SET    valid_on = f_array_sort(m.valid_on || u.valid_on) -- sorted array, see below
FROM   updates u
WHERE  m.rec_id = u.rec_id 
AND    m.valid_during @> u.valid_on  -- additional check
AND    m.val = u.val 
AND    NOT m.valid_on @> ARRAY[u.valid_on];

I assume the whole possible date range is always covered for each existing rec_id and valid_during shall not overlap per rec_id, or you'd have to do more.

After installing the additional module btree_gist, add an exclusion constraint to rule out overlapping date ranges if you don't have one, yet:

ALTER TABLE master ADD CONSTRAINT EXCLUDE
USING gist (rec_id WITH =, valid_during WITH &&)  -- disallow overlap

The GiST index this is implemented with is also a perfect match for the query. Details:

2nd / 3rd case

Assuming that every date range starts with the smallest date in the (now sorted!) array: lower(m.valid_during) = m.valid_on[1]. I would enforce that with a CHECK constraint.

Here we need to create one or two new rows In the 2nd case it is enough to shrink the range of the old row and insert one new row In the 3rd case we update the old row with the left half of array and range, insert the new row and finally insert the with the right half of array and range.

Helper functions

To keep it simple I introduce a new constraint: every array is sorted. Use this helper function

-- sort array
CREATE OR REPLACE FUNCTION f_array_sort(anyarray) 
  RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT unnest($1) ORDER BY 1)$$;

I don't need your helper function arraymin() any more, but it could be simplified to:

CREATE OR REPLACE FUNCTION f_array_min(anyarray) 
  RETURNS anyelement LANGUAGE sql IMMUTABLE AS
$$SELECT min(a) FROM unnest($1) a$$;

Two more to get the left and right half of an array split at a given element:

-- split left array at given element
CREATE OR REPLACE FUNCTION f_array_left(anyarray, anyelement) 
  RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT * FROM unnest($1) a WHERE a < $2 ORDER BY 1)$$;

-- split right array at given element
CREATE OR REPLACE FUNCTION f_array_right(anyarray, anyelement) 
  RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT * FROM unnest($1) a WHERE a >= $2 ORDER BY 1)$$;

Query

This does all the rest:

WITH u AS (  -- identify candidates
   SELECT m.id, rec_id, m.val, m.valid_on, m.valid_during
        , u.val AS u_val, u.valid_on AS u_valid_on
   FROM   master  m
   JOIN   updates u USING (rec_id)
   WHERE  m.val <> u.val
   AND    m.valid_during @> u.valid_on
   FOR    UPDATE  -- lock for update
   )
, upd1 AS (  -- case 2: no overlap, no split
   UPDATE master m  -- shrink old row
   SET    valid_during = daterange(lower(u.valid_during), u.u_valid_on)
   FROM   u
   WHERE  u.id = m.id
   AND    u.u_valid_on > m.valid_on[array_upper(m.valid_on, 1)]
   RETURNING m.id
   )
, ins1 AS (  -- insert new row
   INSERT INTO master (rec_id, val, valid_on, valid_during)
   SELECT u.rec_id, u.u_val, ARRAY[u.u_valid_on]
        , daterange(u.u_valid_on, upper(u.valid_during))
   FROM   upd1
   JOIN   u USING (id)
   )
, upd2 AS (  -- case 3: overlap, need to split row
   UPDATE master m  -- shrink to first half
   SET    valid_during = daterange(lower(u.valid_during), u.u_valid_on)
        , valid_on = f_array_left(u.valid_on, u.u_valid_on)
   FROM   u
   LEFT   JOIN upd1 USING (id)
   WHERE  upd1.id IS NULL  -- all others
   AND    u.id = m.id
   RETURNING m.id, f_array_right(u.valid_on, u.u_valid_on) AS arr_right
   )
INSERT INTO master (rec_id, val, valid_on, valid_during)
          -- new row
SELECT u.rec_id, u.u_val, ARRAY[u.u_valid_on]
     , daterange(u.u_valid_on, upd2.arr_right[1])
FROM   upd2
JOIN   u USING (id)
UNION ALL  -- second half of old row
SELECT u.rec_id, u.val, upd2.arr_right
     , daterange(upd2.arr_right[1], upper(u.valid_during))
FROM   upd2
JOIN   u USING (id);

SQL Fiddle.

Notes

You need to understand the concept of data-modifying CTEs (writeable CTEs), before you touch this. Judging from the code you provided, you know your way around Postgres.
- Are SELECT type queries the only type that can be nested?
FOR UPDATE is to avoid race conditions with concurrent write access. If you are the only user writing to the tables, you don't need it.
I took a piece of paper and drew a timeline so not to get lost in all of this.
Each row is only updated / inserted once, and operations are simple and roughly optimized. No expensive window functions. This should perform well. Much faster than your previous approach in any case.
It would be a bit less confusing if you'd use distinct column names for u.valid_on and m.valid_on, which are related but different things.
I compute the right half of the split array in the RETURNING clause of CTE upd2: f_array_right(u.valid_on, u.u_valid_on) AS arr_right, because I need it several times in the next step. This is a (legal) trick to save one more CTE.
As for solutions that don't involve unnesting the master table: You have to unnest the array valid_on either way, to split it, at least as long as it's not sorted. Also, your helper function arraymin() is already unnesting it anyway.