Sql-server – Incremental extract strategy using SSIS and CDC

change-data-capturedata-warehousesql serverssis

I'm looking for a strategy for extracting data from a SQL Server database that is configured for change data capture (CDC) using SSIS.

Most of the documentation I have seen around using CDC and SSIS are simple scenarios where there is one source table configured with CDS, a conditional split to determine the inserts, updates and deletes and then those corresponding actions.

My current extract consists of queries that join multiple tables together and then feeds that into the transform and load. Changes could happen in any of those tables, not just one.

One approach could be to just look for changes on the base table and do lookups or merges on the other tables. However that would miss changes on a supporting table if there is no corresponding change in the source table.

Another option would be to do individual extracts on all tables involved and then do the joining in the transformation layer. But again, there could be times that one or more tables isn't changed and isn't available for the join.

What I'm leaning towards is one of these 2 options:

Totally separate all tables into their own ETL that then do multiple inserts and updates on the destination table. For example, only process inserts for the source table of the query and process updates for all the supporting tables after the source table has been populated.
Store all previous extracts for a table in a versioned directory in the spirit of a "data lake" and then look up the related record from a previous version.

Is there a strategy that I'm not aware of?

I have used the CDC functions to extract data in the past. My plan was to try to use the SSIS CDC components themselves if possible.

My main reason for looking into this is to increase the frequency of our loads to be on an hourly basis instead of a nightly basis. Using CDC on my biggest tables would reduce the load, and reduce the time needed. It would still be ideal to only pull the data that has changed no matter the size.

We also have the ability to add a read only secondary to our AG to scale out reads similar to what Dave describes in his answer. However, the time taken to pull all the data out would still be the same even though the same load wouldn't be on the source system.

Best Answer

This appears to be your stumbling block in relation to most of the demos of SSIS CDC:

My current extract consists of queries that join multiple tables together and then feeds that into the transform and load. Changes could happen in any of those tables, not just one.

As such you're on the right track with your first proposed option. For your join issues - it seems that you either have data that changes in tandem or a data set that shouldn't implement CDC. You should probably reduce the number of CDC tables and do a full load from others. At the least you could come up with a custom incremental that makes sense for your JOIN requirements while also reducing your data set size. Something like:

SELECT *
FROM InfoTable
WHERE Date >= '01-01-2016'

The typical pattern for data warehouse ETL is to move table by table from source to a Staging database, which replicates the table structure of the source systems nearly 1 to 1 (only for tables you actually use and skip things like IDENTITY which will fail your inserts/transfer). Following this pattern should solve at least part of the CDC issue.

From Staging to DataWarehouse you should have your joins, business logic, clean-up, etc. in the ETL. One of the benefits of this pattern, beside simplifying incremental source system pulls, is the reduced load on LoB/OLTP systems. You get in and out from the source system as fast as possible and leave the majority of the ETL burden on the SSIS/DW server(s).

You could setup a variety of incremental load patterns on all your tables. The importance of moving ETL logic/resource use out of source systems becomes even more important if you're running that frequently. If you're trying to meet an hour load window there may be other options beside just incremental loads... even if your time is consumed by data transfer (e.g. network settings).

1. Determine which tables are getting updated the most.

Try running the extractor for a few hours and let the THL grow. Then, you can count the number of changes logged for each table using a command like:

./bin/thl list | gawk -f thl_summary.awk | sort -n -r

Where thl_summary.awk is:

#!/bin/gawk
# Sum up rows from a `bin/thl list` output
BEGIN {
    OFS="\t"
}
match($0, /- TABLE = (.*)/, a) {
    table=a[1]
}
match($0, /- ROW# = (.*)/, a) {
    rows[table]++
}
END {
    for (k in rows) {
        print rows[k], k
    }
}

2. Replicate only those tables and columns that are necessary.

Looking at the list of tables, ask yourself if you can exclude any of them.

Exclude CLOB columns.
- They aren't really supported by synchronous CDC.
- Even if you specify a subset of the columns in DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE, each and every change will be logged to the change table¹. But by excluding CLOB columns, we can lessen the load of costly writes and reads.
Exclude all tables that are not really needed on the downstream database.
- If necessary, add a fallback table with a subset of the columns that you really need, even if it means denormalizing a bit.
- The way Tungsten configures CDC, a single update to a source row generates two rows in the corresponding change table². These extra updates can add up if you have a frequently updated table.

Once you've determined which tables and columns to replicate, you can specify them in a file named tungsten.tables:

aSubsetOfTheTable<TAB>column1,...,columnN
wholeTable

Then, you need to re-run the setupCDC.sh script with setupCDC.conf:

specific_tables=1

¹ https://community.oracle.com/thread/2288254

² Updated rows will generate a record with operation$='UU' ... and all the old values, and another record with operation$='UN' or 'UL' with new values. (source)

Best Answer

Related Solutions

Sql-server – Delayed availability of historical data when using CDC

Which parameters and metrics to look at when CDC based replication is causing high load

1. Determine which tables are getting updated the most.

2. Replicate only those tables and columns that are necessary.

Related Question