Is it faster to truncate destination and copy all source data or lookup and keep only non-matching data with SSIS transfer packages

azure-sql-databaseetlssis

I am building an Azure SQL DB database and to do this I will have to regularly (weekly-monthly) pull data from various other servers. I am designing SSIS packages to do this and whilst most of the data import tasks run within an hour, a few take much longer (it's been running all day and hasn't finished yet) due to the much larger number of rows in these tables. Optimally, the transfers would run as quickly as possible to minimise the costs of running our Azure SSIS IR.

I have so far been transferring the data by truncating each destination table (to prevent duplication) and then copying over the entire contents of the source table (as I just want to replicate it).

Would it (in general, assuming the majority of rows in each table are not new this week/month) be faster to transfer data in this way or would it be preferable to use a lookup task and copy only non-matching (i.e. new) data from the source database to the destination, rather than truncating the table and refilling it completely each time? (Please correct me if there's a more appropriate way of doing this or I've misunderstood how lookup works!)

I can't currently test which is faster as the initial transfer is still going, so the tables aren't fully populated yet, and also I'm new to SSIS so would really appreciate some advice from people more familiar with it. For context, I only have read-only access to the source servers and not all source tables have unique row IDs, so matching on row ID alone or changing the source tables are not strategies I can use.

Best Answer

Is it faster...

It depends on a great many things so there is no single correct straight answer to this:

  • What proportion of the target rows are you needing to delete, update or insert anew, if you don't use the truncate+replace method
  • The performance metrics of the source and target systems
  • The network performance between where your SSIS packages are running and the source and target resources

If you are replacing or updating most of the rows then truncate and recreate will probably be faster, if you only need to affect relatively few then partial updates are likely to be faster but this is not always the case.

If truncate+replace does turn out to be faster (or only slightly slower) you may find dropping all indexes (apart from the clustered index) and recreating them after the copy will improve matters further.

assuming the majority of rows in each table are not new this week/month

In this instance if you know exactly what timestamp you need data from (this can usually be derived from the target's current data, or you could store the relevant value for future reference after each process) and you can efficiently ask the sources for only the updated data, then this will definitely be faster. Even if for some reason you only know roughly and have to request data a bit further back to make sure you get everything, you will not only be writing less at the destination but will be transferring less from each target (just the recent rows) and from the destination if needing to merge (again, just the recent rows) than you would pulling all the data from the sources.

I can't currently test which is faster as the initial transfer is still going

You can do some useful analysis at this point though: what seems to be the current bottleneck? It could be:

  • waiting for data to be read from the source(s) - i.e. their IO subsystem or DB design is limiting the processes speed, in thsi case you'll see high IO at that end from your process and less activity elsewhere
  • waiting for data to be transferred from the source(s) - i.e. bandwidth is a limiting factor, is that link saturated while everything else is more idle?
  • processing by SSIS if your ETL is more complicated than "read from here, write to there" - i.e. are you seeing a lot of CPU and/or IO activity from SSIS?
  • waiting for data to be transferred to the target - as above but in reverse
  • waiting for data to be written at the target - as above but in reverse
  • ... or of course some combination of the above

Keep watching as the tasks proceed: does it look like the same key bottleneck(s) most of the way through the process?