SSIS – Load Using SQL Command

ssis

I'm trying to insert the below query using SSIS. I'm doing this just because there are some overlapping ID's that are already existing ID's in the destination table. I tried using the following to get the Min and Max IDs for both tables in order to insert only the non existing rows:

/*
--SourceTable
 MinID  MaxID
select 552596452-633166940

--destination table
MinID        MaxID
450485204  596693972
*/

Then I now used that to implement the insert using SSIS Package. I couldn't get this package to run. Is there anything that i'm doing wrong? is there any other SSIS data flow that I could use to work around this? I chose to use SSIS because the record is aroung 30 million rows which will take too long to run on SQL and it's going to also fill up the logspace pretty quicky which will affect the performance of the server.

INSERT INTO dbo.destinationtable          
           ([id]      
           ,[userId]      
           ,[serviceId]      
           ,[keyId]      
           ,[wappush]      
           ,[builturl]      
           ,[datestamp]      
           ,[pocessingTime]      
           ,[UserAgent]      
           ,[Referer]      
           ,[fallBack]      
           ,[Error]
           ,[ClientIP]
           ,[AdvertUrl])  


SELECT  id, userId, serviceId, keyId, wappush, builturl, datestamp, pocessingTime, 
          UserAgent, Referer, fallBack, Error, ClientIP, AdvertUrl
FROM  sourcetable a (nolock)
WHERE (datestamp < DATEADD(d, 0, DATEDIFF(d, 0, GETDATE() - 60))) 
AND id > 596693972 and id <= 633166940
ORDER BY id

Best Answer

If you're going to use SSIS, may I suggest you use it? ;)

What I believe you are attempting to do is insert some data that doesn't already exist in the target. You have identified what data needs to go using your min/max logic and you attempted to do a similar operation for the target.

For starters, the OLE DB Command is something you should use sparingly. It performs singleton operations and so the volume of rows you want this thing handling should be in the tens, maybe hundreds. Definitely not 80 million.

Your data flow should look something like this

enter image description here

I have an OLE DB Source (ADO.NET can be used but if you need to parameterize your query, it's a little more complex). You can use your existing query or if you need to chunk it up because you have active systems, so be it.

The Lookup Component, LKP RM_DimEmplo... is the key to this operation. When the package starts, it is going to run the source query in that component and cache all the values. That sounds expensive, so don't let it be. Don't just select dbo.destinationTable. Write your query to bring back only the keys you need to make a match. It should be those IDs. That will be very narrow and shouldn't be so bad. The idea here is that you will have your source rows and you will compare those to the cache of the target table based on those keys. Since you're not worried about detecting differences, you know that anything that doesn't match needs to go, so on the "General" tab, change the "Specify how to handle rows with no matching entries" to use Redirect rows to no match output. Otherwise, this will blow up when it doesn't make a match.

If you find it's still impactfull on your systems, then you need to go to a more advanced approach of using Expressions on the Lookup Component to filter the range of data to match the IDs coming in. Thus, you have identified the source range is 100 to 200. There's no need to pull in the full 0 to 1000 range of keys in the destination, you would simply make the lookup cache the keys that are in the operational range. 80M shouldn't be terribly taxing unless you have just no RAM to spare on the machine.

The OLE DB Command does singleton, one-off operations. The OLE DB Destination can either do singleton inserts (default for 2005 - Table or view) or set/batch based updates (Table or View - fast load) which can be screaming cheetah wheelies fast. Since you're loading a lot of data, unless you want your admins to hate you, change the Max insert commit size to something other than the default of 2147483647. That basically says don't commit nothing until it's all been inserted. That can cause your transaction log to get big in a hurry.

The nice thing about your problem is that you can run this to your heart's content. Say 1M rows get loaded and something bad happens. Just restart. However much got committed and when it restarts, it might pull the same source data but your lookup will have the most recent set of keys from the target so the result will be no operation performed for those. I know, you'd think everyone loves duplicates in their target system but for some reason, that just doesn't fly at some places...

Wrapup

I can't recommend the Stairway to Integration Services series enough. Specifically for this problem, Level 3