Sql-server – CDC Capture Instance or State to ensure referential integrity

change-data-capturesql serversql-server-2012ssis-2012

I'm implementing an SSIS/CDC based data warehouse solution and have a question relating to referential integrity on the destination database.

As an example, I have INVENTORY_TRANSACTION in one table and PART in another table each with their own SSIS package for incremental loads. I don't want to have a situation where I have INVENTORY_TRANSACTIONS in the data warehouse that don't have an associated parent PART record.

I've read quite a bit on the subject but I'm still unclear.

Should I be using the same capture_instance on the tables that have related data so they all target the same LSN chain? Should I be using the same State name? Or, am I missing something entirely?

Best Answer

From Microsoft docs. Guess I just had to find the right doc... https://docs.microsoft.com/en-us/sql/integration-services/data-flow/cdc-flow-components#grouping-tables-for-cdc-processing

Grouping Tables for CDC Processing

Database projects range in size from several tables to many thousands of tables. When designing initial load and CDC packages, it is beneficial to group tables in much smaller groups for easier management and efficiency. This section lists various considerations that impact the sorting of tables into small groups, where the tables in each are initially loaded and then updated as a group.

The CDC patterns supported by the CDC components assume that this grouping is already determined. Each group defines a separate CDC context that is maintained separately from other groups. For each group, initial-load and trickle-feed update packages are created. Trickle-feed updates are scheduled for periodic runs based on the rate of change processing constraints (for example, CPU and IO consumption, impact on other systems) and the desired latency.

Tables are grouped based on the following considerations:

According to the target database. All tables that are written to different target databases or undergo different processing should be assigned to different CDC groups.

Tables that are related with referential integrity constraints should be assigned to the same group to avoid referential integrity problems at the target.

Tables for which higher latency can be tolerated can be grouped so they can be processed less frequently and reduce overall system load.

Tables for which there is a higher rate of change should be in smaller groups, and tables with a low rate of change can be grouped in larger groups.

The following two packages are created for each CDC group:

An Initial Load package, which reads the entire range of data from the source tables and applies it to the target tables.

A trickle-feed update package that reads changes made to the source tables and applies the changes to the target tables. This package should be executed on a regularly scheduled basis.

CDC State

Each CDC group has a state associated with it, which is represented by a string with a specific format. For more information, see CDC Control Task. The following table shows the possible CDC state values.

StateDescription0-(INITIAL)The state that exists before any packages are run on the current CDC group. This is also the state when the CDC state is empty.

For more information about CDC Control task operations, see CDC Control Task.1-ILSTART (Initial-Load-Started)This is the state that exists when the initial load package starts. This occurs after the MarkInitialLoadStartoperation call to the CDC Control task.

For more information about CDC Control task operations, see CDC Control Task.2- ILEND (Initial-Load-Ended)This is the state that exists when the initial load package ends successfully. This occurs after the MarkInitialLoadEnd operation call to the CDC Control task.

For more information about CDC Control task operations, see CDC Control Task.3-ILUPDATE (Initial Load Update)This is the state that exists after the first run of the Update package after the initial load while still processing the initial processing range. This occurs after the GetProcessingRangeoperation call to the CDC control task.

If using the _$reprocessing column, it is set to 1 to indicate that the package may be reprocessing rows already at the target.

For more information about CDC Control task operations, see CDC Control Task.4-TFEND (Trickle-Feed-Update-Ended)This is the state expected for regular CDC runs. It indicates that the previous run completed successfully and that a new run with a new processing range can be started.5-TFSTART (Trickle-Feed-Update-Started)This is the state that exists on subsequent runs of the Update package after the GetProcessingRangeoperation call to the CDC control task.

This indicates that a regular CDC run is started, but is not finished or has not yet finished, cleanly (MarkProcessedRange).

For more information about CDC Control task operations, see CDC Control Task.6-TFREDO (Reprocessing-Trickle-Feed-Updates)This is the state on a GetProcessingRangethat occurs after TFSTART. This indicates that the previous run did not complete successfully.

If using the __$reprocessing column, it is set to 1 to indicate that the package may be reprocessing rows already at the target.7-ERRORThe CDC group is in an ERROR state.

Here is the state diagram for the CDC components. An ERROR state is reached when a state is reached that is not expected. The expected states are illustrated in the following diagram. However the diagram does not show the ERROR state.

For example, at the end of an initial load package, when trying to set the state to ILEND, if the state is TFSTART then the CDC group is in an error state and the Trickle-Feed Update package does not run (the Initial Load package does run).

Once the Initial Load package runs successfully, the Trickle-Feed Update package runs repeatedly under a predetermined schedule to process changes to the source tables. Each run of the Trickle-Feed Update package is a CDC run.

Related Question