I am building an Azure SQL DB database and to do this I will have to regularly (weekly-monthly) pull data from various other servers. I am designing SSIS packages to do this and whilst most of the data import tasks run within an hour, a few take much longer (it's been running all day and hasn't finished yet) due to the much larger number of rows in these tables. Optimally, the transfers would run as quickly as possible to minimise the costs of running our Azure SSIS IR.
I have so far been transferring the data by truncating each destination table (to prevent duplication) and then copying over the entire contents of the source table (as I just want to replicate it).
Would it (in general, assuming the majority of rows in each table are not new this week/month) be faster to transfer data in this way or would it be preferable to use a lookup task and copy only non-matching (i.e. new) data from the source database to the destination, rather than truncating the table and refilling it completely each time? (Please correct me if there's a more appropriate way of doing this or I've misunderstood how lookup works!)
I can't currently test which is faster as the initial transfer is still going, so the tables aren't fully populated yet, and also I'm new to SSIS so would really appreciate some advice from people more familiar with it. For context, I only have read-only access to the source servers and not all source tables have unique row IDs, so matching on row ID alone or changing the source tables are not strategies I can use.
Best Answer
It depends on a great many things so there is no single correct straight answer to this:
If you are replacing or updating most of the rows then truncate and recreate will probably be faster, if you only need to affect relatively few then partial updates are likely to be faster but this is not always the case.
If truncate+replace does turn out to be faster (or only slightly slower) you may find dropping all indexes (apart from the clustered index) and recreating them after the copy will improve matters further.
In this instance if you know exactly what timestamp you need data from (this can usually be derived from the target's current data, or you could store the relevant value for future reference after each process) and you can efficiently ask the sources for only the updated data, then this will definitely be faster. Even if for some reason you only know roughly and have to request data a bit further back to make sure you get everything, you will not only be writing less at the destination but will be transferring less from each target (just the recent rows) and from the destination if needing to merge (again, just the recent rows) than you would pulling all the data from the sources.
You can do some useful analysis at this point though: what seems to be the current bottleneck? It could be:
Keep watching as the tasks proceed: does it look like the same key bottleneck(s) most of the way through the process?