Sql-server – TSQL vs SSIS Lookup

data-warehousesql serverssist-sql

I was trying to find some well performing Tsql equivalents for the SSIS Lookup component. Especially for Fact table loading.

After few different query patterns the most obv performed the best:

 INSERT INTO Fact WITH (TABLOCK)
 SELECT 
  F.Attribute1,
  ...
  d1.DimAId,
  d2.DimBId,
  d3.DimCId,
  ...
  F.Amount
  ...
  from Staging.dbo.GeneralLedger F WITH(NOLOCK)

  LEFT OUTER JOIN  DimA1 d1 WITH(NOLOCK)
                on d1.AId = F.AId

  LEFT OUTER JOIN DimB d2 WITH(NOLOCK)
                on d2.BId = F.BId

  LEFT OUTER JOIN DimC d3 WITH(NOLOCK)
                on d3.CId = F.CId

Performance wise I tested this with 5 million rows and lookups to 9 dimensions.

SSIS: 1m14s
TSQL: 1m0s

Which will be the implications if I choose TSQL over SSIS on larger datasets (+100M rows)
Currently I tested this on a 200GB RAM server so no problems fitting everything in memory.

However I suppose if this query would be used on a machine with less ram it will starts spilling to disk and hurt performance badly. SSIS would be memory wise more efficient then. Am I correct on this assumption?

Another idea I come to mind is using a cursor loop to only do this with tsql for exemple for 1M rows at the time till all data is loaded. By doing this you will have less chance to get an out of memory and start wasting to disk. Is this a viable approach or see you any other approaches?

PS: I am fully aware that SSIS is the way to go for fact table loading. However let us assume we only want TSQL.
PS2: Question also posted on stackoverflow https://stackoverflow.com/questions/26638590/tsql-vs-ssis-lookup. However no real groundbreaking response.

Best Answer

Batching large data sets into smaller data sets, e.g. 100,000 rows or 1,000,000 rows will make the load run better than One Big Insert. But the same is true from SSIS, of course, as it batches inserts.

The difference in time that you show in your example is fairly small. That does not give much hope for great speed improvements, but it does encourage you to know that you can use other methods for import.

EDITS included below.

If you are willing to first extract to a .csv file, perhaps use BCP to extract from a view to the .csv file. Then you can use BULK INSERT to load that file into your database:

BULK INSERT described: http://msdn.microsoft.com/en-us/library/ms188365.aspx

If you want to do the imports all in TSQL then you might use the BULK INSERT command. When you look as its parameters, you will see that it includes KILOBYTES_PER_BATCH and ROWS_PER_BATCH as methods of controlling batch size.

Because of the speed of BCP and BULK INSERT, along with the control of the batch size, I believe that this would be one of the quicker methods to use. But it does require the intermediate .csv file.

Other issues: Is your data coming from the same server, or from another server? If the same server, then access is fairly east. If another server and you do not want the intermediate .csv file, you can create a linked server to query the data from the other server.

However, as you noted, this will mean that you have to manage the batches yourself. This is similar to what you described as a 'cursor loop', although it probably does not need a cursor, just a loop selecting the next 'n' rows that you want to copy. If the data is on another server the overhead on this approach will be higher.