Sql-server – When to use CDC to track history

change-data-capturedata-versioningsql server

SQL Server Change Data Capture is a feature that reads historical data from the SQL Server transaction logs and stores them in a special table.

Through the use of special table value functions (TVF) it then allows the user to query this data, making it either possible to get all the changes on a specific table or only the net changes that resulted from the changes within a specific time.

CDC has certain advantages

It can be configured to only track certain tables or columns.
It is able to handle model changes to a certain degree.
It does not affect performance as heavily as triggers because it works with the transaction logs.
It is easily enabled/disabled and does not require additional columns on the table that should be tracked.

It also has some disadvantages:

The amount of history data can become huge fast.
You are not able to track who made the changes (at least not for deletes).
The history data takes some time to catch up, because it is based on the transaction logs.
It depends on the SQL Server Agent. If the Agent is not running or crashes, no history is being tracked.

I have read quite a lot about CDC and while I know now how to use it, I am still not sure if it is the right tool for me.

For which tasks/scenarios is CDC the right tool? (e.g. Allowing users to restore a data object to a certain point in time? Auditing? Showing the complete history of data?)
When should you rather not use CDC, but resort to a custom trigger-based solution?
Is it ok to use CDC in an operational database and make use of the CDC data within an operational application? (e.g. showing it to the end user) Or is this clearly a misuse of this feature?

I commonly hear that CDC is an audit tool, but isnt that what SQL Server Audit is for? Are they both different tools for the same task? Or can CDC used for other things?

My current scenario is that I am asked to build a reliable data framework which is supposed to be the basis for multiple future applications. The exact requirements are blurry, but one is that it should be able to track data history and restore older entries together with all related data from other tables. I am evaluating CDC right now as an option, but am uncertain if this is the way to go, because I can't really find any recommended use cases.

While I appreciate advice for my specific scenario, answers should give general advice about when or when not to use Change Data Capture.

Best Answer

Firstly,

Change data capture is available only on the Enterprise, Developer, and Evaluation editions of SQL Server.

So that may decide for you if any of your customers will not have the enterprise editions, or you don't yet know you will be using the enterprise editions. (As the spec includes "multiple future applications" this may be an real issue for you)

Unlike triggers it is not real time, this is both an advantage and a disadvantage. Using triggers always slow down an update.

I worked on one system when we used triggers (generated by CodeSmith), as well as tracking all the changes to the records, we also linked the changes together to a “history” table that included the module of the application that made the change, and the UI item the user used to make the change.

However you may be best solving this at the application level, by say writing all update to a message queue that is then replayed to create a database at any given point of time, see Temporal Patterns on Martin Flowler blog for a good overview of options.

Related Solutions

Sql-server – How to retrieve the Capture Instance for a given table

Input (hopefully stored procedure parameters):

DECLARE @table1 NVARCHAR(513) = N'dbo.Users',
        @table2 NVARCHAR(513) = N'dbo.DataObjects',
        @ID     INT = 10;

Code:

DECLARE @InstanceName1 NVARCHAR(513),
        @InstanceName2 NVARCHAR(513),
        @Begin_LSN     BINARY(10),
        @End_LSN       BINARY(10);

SELECT @InstanceName1 = c.capture_instance
  FROM cdc.change_tables AS c
  INNER JOIN sys.tables AS t
  ON c.[source_object_id] = t.[object_id]
  INNER JOIN sys.schemas AS s
  ON t.[schema_id] = s.[schema_id]
  WHERE t.name = PARSENAME(@table1,1)
    AND s.name = PARSENAME(@table1,2);

SELECT @InstanceName2 = c.capture_instance
  FROM cdc.change_tables AS c
  INNER JOIN sys.tables AS t
  ON c.[source_object_id] = t.[object_id]
  INNER JOIN sys.schemas AS s
  ON t.[schema_id] = s.[schema_id]
  WHERE t.name = PARSENAME(@table2,1)
    AND s.name = PARSENAME(@table2,2);

SELECT @Begin_LSN = sys.fn_cdc_get_min_lsn(@InstanceName1), 
       @End_LSN   = sys.fn_cdc_get_max_lsn();

DECLARE @sql NVARCHAR(MAX);

SET @sql = N'SELECT * 
  FROM cdc.fn_cdc_get_all_changes_' + @InstanceName2
  + '(@b, @l, ''ALL'') AS a 
  INNER JOIN cdc.fn_cdc_get_all_changes_' + @InstanceName1
  + '(@b, @l, ''ALL'') AS b 
  ON a.__$start_lsn = b.__$start_lsn
  WHERE a.ID = @ID;';
  ------^ guessing here

EXEC sp_executesql @sql, 
  N'@b BINARY(10), @l BINARY(10), @ID INT',
  @Begin_LSN, @End_LSN, @ID;

Sql-server – Delayed availability of historical data when using CDC

CDC relies on a SQL Server Agent job to capture information from the transaction log. You can customize the job parameters by updating msdb.dbo.cdc_jobs - there are multiple properties as described in this Books Online topic:

maxtrans
maxscans
continuous
pollinginterval

That doc has some information about these parameters, as does this white paper.

But really, CDC is not meant to be a real-time system, and you should use caution adjusting these parameters too much to make it so. Most people acknowledge that you need to wait to let CDC catch up (see here and here). If 5-10 seconds is not fast enough for you, you might be using the wrong technology.

As for the separate transaction question, this is something you could test, no?

Best Answer

Related Solutions

Sql-server – How to retrieve the Capture Instance for a given table

Sql-server – Delayed availability of historical data when using CDC

Related Question