When importing raw files for internal conversion, should I use a secondary database or just isolate them within the database?
I've got between 1.25 and 2GB of CSV files to be imported (and already have most of the process running smooth as butter) so my question is: does it make sense as a "best practice" to use a "secondary" database for the import or just load them into the database that I'm going to be working in?
Example:
FinalDatabase -- this is the one I'm wanting to do, unless there's a net negative
core.Users
core.Addresses
core.Emails
core.OtherTables
staging.UsersImport
staging.ContactImport
OR
FinalDatabase
core.Users
core.Addresses
core.Emails
core.OtherTables
StagingDB
dbo.UsersImport
dbo.ContactImport
Obviously I'll be migrating from one table to the others via scripts, so it doesn't make much difference what the four-part name is going to be, whether it's just
INSERT INTO core.Users (fieldlist) SELECT fieldlist FROM staging.Users
or
INSERT INTO core.Users (fieldlist) SELECT fieldlist FROM StagingDB.dbo.Users
The cons I've identified with the first style are the following:
- Increases backup size
- Increases file size (including increasing the file size of the existing database — autogrowth issues)
- Negates the potential use of multiple spindles (in my case not an option 🙁 )
Some of the pros I've identified with the first style:
- The original import data stays with the database in case of future questions (until someone truncates it, of course)
- It gets backed up with the latter data in case you need to recover something
What would be considered a best practice in this situation and why?
Best Answer
Well, since you have the CSV files, they can serve as a backup to the staging tables, so I see no benefit to keeping them in the primary database (or maintaining the staging tables after the import is done). If those staging tables can be discarded / truncated once the import is finished, then they really should be separate and have as minimal impact on the primary database as possible.
Also, I see no value in putting the staging tables in the primary database in any case, no matter how permanent the loaded data will be. And if you need the data to be permanent, you can always reference the data even if it is in a different database - in fact that is arguably better, because now:
you can handle that data alone independently:
In addition to the cons you've already identified, well, they're essentially just scratch tables. Even if you're only on a single spindle, why bother disrupting the primary database? There's no penalty in crossing database boundaries on the same instance, e.g.
INSERT ... FROM db.dbo.table;
compared toINSERT ... FROM dbo.table;
.Finally, when you say "autogrowth" I cringe. If you know your database is going to get bigger than it is now, why not manually expand it to the size you need it to be? Autogrow is a performance nightmare when it happens during peak activity - and you will certainly notice it happening even on a single machine and even as the only one performing any work - basically all of your data transfer will halt when either the data or log file needs to grow. This is particularly true if you've left the autogrow settings at the current, ridiculous, defaults, and/or if you don't have instant file initialization. Don't forget to set your log growth settings appropriately too, as log files are not able to take advantage of IFI.