Sql-server – When importing raw files for internal conversion, should I use a secondary database or just isolate them within the database

best practicescsvsql-server-2008-r2

When importing raw files for internal conversion, should I use a secondary database or just isolate them within the database?

I've got between 1.25 and 2GB of CSV files to be imported (and already have most of the process running smooth as butter) so my question is: does it make sense as a "best practice" to use a "secondary" database for the import or just load them into the database that I'm going to be working in?

Example:

FinalDatabase -- this is the one I'm wanting to do, unless there's a net negative
  core.Users
  core.Addresses
  core.Emails
  core.OtherTables
  staging.UsersImport
  staging.ContactImport

FinalDatabase
  core.Users
  core.Addresses
  core.Emails
  core.OtherTables
StagingDB
  dbo.UsersImport
  dbo.ContactImport

Obviously I'll be migrating from one table to the others via scripts, so it doesn't make much difference what the four-part name is going to be, whether it's just

INSERT INTO core.Users (fieldlist) SELECT fieldlist FROM staging.Users

INSERT INTO core.Users (fieldlist) SELECT fieldlist FROM StagingDB.dbo.Users

The cons I've identified with the first style are the following:

Increases backup size
Increases file size (including increasing the file size of the existing database — autogrowth issues)
Negates the potential use of multiple spindles (in my case not an option 🙁 )

Some of the pros I've identified with the first style:

The original import data stays with the database in case of future questions (until someone truncates it, of course)
It gets backed up with the latter data in case you need to recover something

What would be considered a best practice in this situation and why?

Best Answer

Well, since you have the CSV files, they can serve as a backup to the staging tables, so I see no benefit to keeping them in the primary database (or maintaining the staging tables after the import is done). If those staging tables can be discarded / truncated once the import is finished, then they really should be separate and have as minimal impact on the primary database as possible.

Also, I see no value in putting the staging tables in the primary database in any case, no matter how permanent the loaded data will be. And if you need the data to be permanent, you can always reference the data even if it is in a different database - in fact that is arguably better, because now:

you can handle that data alone independently:
- set the recovery model of that database to simple
- mark it as read only
- heck back it up and drop it, knowing that you can always recover it later
- restrict access to it in simpler and higher level ways than you could tables in the primary database (to prevent the "until someone truncates it" fear)
or just abandon it because, as I said above, your CSV files are already acting as a secondary backup.

In addition to the cons you've already identified, well, they're essentially just scratch tables. Even if you're only on a single spindle, why bother disrupting the primary database? There's no penalty in crossing database boundaries on the same instance, e.g. INSERT ... FROM db.dbo.table; compared to INSERT ... FROM dbo.table;.

Finally, when you say "autogrowth" I cringe. If you know your database is going to get bigger than it is now, why not manually expand it to the size you need it to be? Autogrow is a performance nightmare when it happens during peak activity - and you will certainly notice it happening even on a single machine and even as the only one performing any work - basically all of your data transfer will halt when either the data or log file needs to grow. This is particularly true if you've left the autogrow settings at the current, ridiculous, defaults, and/or if you don't have instant file initialization. Don't forget to set your log growth settings appropriately too, as log files are not able to take advantage of IFI.

Related Solutions

Sql-server – Increased RAM, Worse Performance

Thanks to everyone for your help.

After pouring through some execution plans, it turns out there is a JOIN that is being processed differently based on the amount of RAM available. With less RAM it evaluates it with a Hash; with more RAM, it uses a series of Merge Joins.

So basically it came down to poorly written T-SQL, which I am presently refactoring.

Sql-server – Is this a good strategy for importing a large amount of data and decomposing as an ETL

If you are confident in the integrity of the data being imported, it may be a good idea to disable all the constraints to your database before beginning your inserts and then re-enabling them after the fact.

See this helpful stack overflow answer from awhile back: Can foreign key constraints be temporarily disabled using T-SQL?

This will save you the head ache of having to worry about layering the inserts in order to respect the existing constraints of the database you are loading into.

In terms of the actual inserts themselves, I'd be on the side of not using cursors. Not only is the process slow but they take up a large amount off memory and create db locks. If you are cursor-ing through a very large amount of rows you also run the risk of very quickly escalating the size of the database logs. If the server is only an average one then, space may eventually be a concern. Try to consider a more set based approach when doing the additional inserts needed for your process.

example, if you can do this:

insert into t1 (col1)
SELECT col1 FROM t2

instead of this:

...
insert into t1 (col1) values ('foo');
insert into t1 (col1) values ('bar');
insert into t1 (col1 values 
...

Best Answer

Related Solutions

Sql-server – Increased RAM, Worse Performance

Sql-server – Is this a good strategy for importing a large amount of data and decomposing as an ETL

Related Question