Sql-server – Is this a good strategy for importing a large amount of data and decomposing as an ETL

etlsql serversql-server-2008-r2

I have a set of five tables (a highly decomposed schema for an ETL if I understand the nomenclature) that I'm going to load via bulk import, then run some inserts from those five tables into a SLEW of other tables, including inserts that just rely on the values I just inserted into the first tables.

I can do my inserts as an A, B, C process, where I insert into the first table, then insert into some table S where exists in A + T (being some other table that has preloaded "configuration data"), then inserting into Z where exists in B + U, etc.

Should I be trying to batch those inserts with a cursor (I know, stone the traitor) or should I just run the raw insert into scripts and let the server die a thousand heat deaths? Basically I'm worried about starving the server or causing it to collapse from too many index or something else with inserts.

Should I stage out the inserts as:

  1. Insert into one set of tables
  2. Insert into secondary tables based on the first tables
  3. Insert into tertiary tables, etc.

OR should I insert into all the tales where the data is needed but do it via cursor, in a "for loop" style pattern of 100k rows at a time.

FWIW, this is a behavior I saw from the DBAs at my last job, so I figure that's "what I'm supposed to do" (the batch process via cursors) but maybe I don't understand enough about what they were doing (they were also live-loading into systems that already had data, and were loading new data afterwards).

Also bear in mind that I'm normally a C# dev, but I've got the most TSQL experience here and I'm trying to make the best process I can for raw-loading this data as opposed to our "current" method that is mostly webservice fetches and NHibernate save-commits.

Things I think are important to the question:

  1. There will be no other load on this server when I do this, I will have complete and total control of the box, and I'm the only user interested here (this is the initial data load before anyone else can do anything with the box)
  2. There are indexes on all the tables, where appropriate, and there are FKs and other predicates
  3. The entire db system is in use for slews of other clients right now, so I know the indexes are needed for operation, but if I should do something to disable those and re-enable them to make the inserts etc faster, I'm looking for those sorts of optimization hacks.
  4. I'm waiting on sample data but I expect the maximum records in a given table to be about a million rows, fairly wide, and that I'm going to insert into those other tables at most a million rows for the ancillary tables that can come secondary or tertiary.
  5. The hardware is "average".

Best Answer

If you are confident in the integrity of the data being imported, it may be a good idea to disable all the constraints to your database before beginning your inserts and then re-enabling them after the fact.

See this helpful stack overflow answer from awhile back: Can foreign key constraints be temporarily disabled using T-SQL?

This will save you the head ache of having to worry about layering the inserts in order to respect the existing constraints of the database you are loading into.

In terms of the actual inserts themselves, I'd be on the side of not using cursors. Not only is the process slow but they take up a large amount off memory and create db locks. If you are cursor-ing through a very large amount of rows you also run the risk of very quickly escalating the size of the database logs. If the server is only an average one then, space may eventually be a concern. Try to consider a more set based approach when doing the additional inserts needed for your process.

example, if you can do this:

insert into t1 (col1)
SELECT col1 FROM t2

instead of this:

...
insert into t1 (col1) values ('foo');
insert into t1 (col1) values ('bar');
insert into t1 (col1 values 
...