SSIS – Improving Multicast Performance

etlssis

I'm wondering how multicast actually works. So here is my setup. I have one source flat file(csv) of about 100k rows, maybe about 250-300 columns, and I'm multicasting that file into 7 one-to-one tables. Only the primary key column is duplicated among these tables. This setup seemed to work fine, but now it seems we're maybe running into memory issues. I'm wondering if multicasting that data to 7 destinations is too much or not. Any thoughts on memory usage? I don't really care how fast this runs, just looking to reduce memory usage.

One alternative I was thinking of was maybe just creating 7 different connection managers for the same file, and running everything sequentially. Would that be better? Thanks!

Best Answer

SSIS gets its power by being an in-memory transformation engine. The base unit of work within a data flow task is the buffer. If you ever wonder why SSIS is so persnickety about data types, it's because it calculates the cost for a row and then allocates memory for N rows. All* the downstream components use the same memory address to do their part of the ETL, which allows for parallelization. This means that the cost for a multicast with 1 output is the same as a multicast with 10 outputs. It simply allows multiple components to consume the data.

Where I'm not 100% certain on is a situation where pre-multicast, I have column called SSN. In output 1 from the MC, I rot-5 the digits in place and write to a destination and output 2 with the unaltered version, there has to be an additional memory cost associated with that but I assume it's just the space required for the column to be duplicated.

*All the downstream components use the same memory address until you introduce asynchronous components (Sort, aggregate, etc). When those hit the data flow, then that buffer ends and the data is copied from memory space 1 into memory space 2. Which sounds good from the perspective of "I just reduced the number of columns from 100 to 10" but you've fractured your total memory space between "before async component" and "after async". Instead of being able to run 100 buffers at once, you're at 30 before and 70 after and that memory copy in between is an expensive operation.

If you want to reduce memory usage, architect your packages to use less ;)

  • You can use less by avoiding the asynchronous transformations.
  • Only pull in the columns you need.
  • If you have a lookup, write a query explicitly enumerating the needed columns instead of clicking the table in the drop down list.
  • If you have a lookup, look at restricting the depth of the rows returned. If you only need to lookup current values, in the query above, make it WHERE ISCurrent = CAST(1 AS bit);
  • Watch your data types. Importing from a file, don't be lazy and start with strings only to convert to a data type later (unless you have questionable data cleanliness)
  • Watch your data types too! It's unlikely you're dealing with a binary/LOB data type coming from a flat file but those data types can't fit in memory with any certainty so SSIS will carry a pointer in memory to an address on disk. While this doesn't eat all your memory, it does slow down processing as many tiny files are created and destroyed during the operation