Sql-server – Duplicating some rows of a data flow in SSIS

etlsql-server-2008ssis

I have a data flow to migrate data from an old database to a new one. The old design had all the data and historical information (changes) stored in a single table with a "version" (incrementing integer) against the row.

The new design has two tables, one for the "current" state of the data and an audit (or history) table which records changes using a trigger. Therefore only one row exists for the "current" data and there are many history rows.

In my SSIS package I am using the following components to copy the current data to one table but then send all the data to the audit table.

SSIS data flow

The Multicast is used to split the stream of data and the Conditional Split identifies the "current" row and sends that to the Order table (the table is not actually called Order, before anyone comments about using a reserved word for a table name).

I created this flow because I could not see a way to use the Conditional Split to send all the data to the Audit destination and only the current row to the other.

I'm assuming that creating all the duplicate data and then discarding it to a Trash Destination is not very efficient and since I have about 52m rows to migrate I'm worried the transformation is going to take days.

Is there a better (more efficient) way to achieve the data split?

Note on data: I have applied a row_number() to the data which allows me to identify the "current" row as number 1, all the rows including "current" need to go to the Audit table destination.

EDIT:
I have found an alternative to the Multicast and Conditional Split suggested by this blog post from SSIS Junkie: Multiple outputs from a synchronous script transform

It uses a Script Component to send data to one or more output. I'm trying that method to see if it's any faster but after seeing Kenneth's answer and suggestion about removing the Trash Destination I'm not sure it will be.

Best Answer

I see no glaring issue with that data flow. I always suggest to do as much work as possible in your source queries, so if you can create a dataset at the outset that allows you to populate both tables via a simple split, it would certainly use less memory. But things like that are not always possible depending on the data source and data format.

Also, the trash destination is good for development/debugging but does no good in production. Remove it. Let the 'Unwanted Data'expire at the split. SSIS can figure out the rest.

As long as you avoid blocking components (UNION, MERGE, etc.) there isn't any reason this process would take days. I frequently process many millions of rows in SSIS without issue. SSIS is only as slow as the person who designed the process.

Is it currently having performance issues?

Best Answer

Related Solutions

SSIS Data Flow Error Output runs all the time

ETL – Extracting from 200 Tables: SSIS Data Flow vs Custom T-SQL

SSIS Design considerations

TSQL vs SSIS (my story)

Related Question