I have a database where I load files into a staging table, from this staging table i have 1-2 joins to resolve some foreign keys and then insert this rows into the final table (which has one partition per month). I have around 3.4 billion rows for three months of data.
What is the fastest way to get these rows from staging into the final table ? SSIS Data Flow Task (that uses a view as source and has fast load active) or an Insert INTO SELECT …. command ? I tried the Data Flow Task and can get around 1 billion rows in around 5 hours (8 cores / 192 GB RAM on the server) which feels very slow to me.
Best Answer
One common approach:
INSERT dbo.[Target] WITH (TABLOCKX) SELECT ...
n
rows, which can reduce the strain on the transaction log, and of course means that if some batch fails, you only have to-start from that batch. I blogged about this (while in reference to deletes, the same basic concepts apply) here: http://www.sqlperformance.com/2013/03/io-subsystem/chunk-deletesIf your partitions are physical and not just logical, you may gain some time by having different processes populate different partitions simultaneously (of course this means you can't use
TABLOCK
/TABLOCKX
). This assumes that the source is also suitable for multiple processes selecting without overlapping / locking etc., and making that side of the operation even slower (hint: create a clustered index on the source that suits the partitioning scheme on the destination).You may also consider something a lot more primitive, like
BCP OUT
/BCP IN
.I don't know that I would jump to SSIS to help with this. There are probably some efficiencies there, but I don't know that the effort justifies the savings.