Sql-server – When importing raw files for internal conversion, should I use a secondary database or just isolate them within the database

best practicescsvsql-server-2008-r2

When importing raw files for internal conversion, should I use a secondary database or just isolate them within the database?

I've got between 1.25 and 2GB of CSV files to be imported (and already have most of the process running smooth as butter) so my question is: does it make sense as a "best practice" to use a "secondary" database for the import or just load them into the database that I'm going to be working in?

Example:

FinalDatabase -- this is the one I'm wanting to do, unless there's a net negative
  core.Users
  core.Addresses
  core.Emails
  core.OtherTables
  staging.UsersImport
  staging.ContactImport

OR

FinalDatabase
  core.Users
  core.Addresses
  core.Emails
  core.OtherTables
StagingDB
  dbo.UsersImport
  dbo.ContactImport

Obviously I'll be migrating from one table to the others via scripts, so it doesn't make much difference what the four-part name is going to be, whether it's just

INSERT INTO core.Users (fieldlist) SELECT fieldlist FROM staging.Users

or

INSERT INTO core.Users (fieldlist) SELECT fieldlist FROM StagingDB.dbo.Users

The cons I've identified with the first style are the following:

  • Increases backup size
  • Increases file size (including increasing the file size of the existing database — autogrowth issues)
  • Negates the potential use of multiple spindles (in my case not an option 🙁 )

Some of the pros I've identified with the first style:

  • The original import data stays with the database in case of future questions (until someone truncates it, of course)
  • It gets backed up with the latter data in case you need to recover something

What would be considered a best practice in this situation and why?

Best Answer

Well, since you have the CSV files, they can serve as a backup to the staging tables, so I see no benefit to keeping them in the primary database (or maintaining the staging tables after the import is done). If those staging tables can be discarded / truncated once the import is finished, then they really should be separate and have as minimal impact on the primary database as possible.

Also, I see no value in putting the staging tables in the primary database in any case, no matter how permanent the loaded data will be. And if you need the data to be permanent, you can always reference the data even if it is in a different database - in fact that is arguably better, because now:

  • you can handle that data alone independently:

    • set the recovery model of that database to simple
    • mark it as read only
    • heck back it up and drop it, knowing that you can always recover it later
    • restrict access to it in simpler and higher level ways than you could tables in the primary database (to prevent the "until someone truncates it" fear)
  • or just abandon it because, as I said above, your CSV files are already acting as a secondary backup.

In addition to the cons you've already identified, well, they're essentially just scratch tables. Even if you're only on a single spindle, why bother disrupting the primary database? There's no penalty in crossing database boundaries on the same instance, e.g. INSERT ... FROM db.dbo.table; compared to INSERT ... FROM dbo.table;.

Finally, when you say "autogrowth" I cringe. If you know your database is going to get bigger than it is now, why not manually expand it to the size you need it to be? Autogrow is a performance nightmare when it happens during peak activity - and you will certainly notice it happening even on a single machine and even as the only one performing any work - basically all of your data transfer will halt when either the data or log file needs to grow. This is particularly true if you've left the autogrow settings at the current, ridiculous, defaults, and/or if you don't have instant file initialization. Don't forget to set your log growth settings appropriately too, as log files are not able to take advantage of IFI.