SQL Server – Bulk Table Creation and Data Import

bulkcsvimportsql server

I have a around 400 CSV files which have been exported from an old Paradox database, with each CSV representing a table within the old database. Each CSV has a different number of columns and different column headings.

I'm looking for suggestions on an easy way to bulk import all of these files into SQL Server, with a new table created for each of the CSV files. No relational structure is required.

What options do I have for quickly getting creating the 400 tables and importing the data?

Best Answer

Why this is a bad idea.

Your dump comes from a Paradox database. Paradox supports a lot of types about 18. CSV is typeless. It's even worse than JSON. Everything is a string.

How will you know which one of those things your string is?

All you can do is infer with heuristics. You've lost that data. Inferring is not going to be easy. To do it right

You'd to create a matrix of broadest type
Elevate types and work sequentially through the type-tree until you fail to cast to a more specific type.

This is possible, but you're still guessing and it's a laborious task. Some languages give you tools to start you down this path, like looks_like_number, and some type systems provide methods like is_type that will help you get there faster.

Good ideas

If you still have the database however, you can plug into ODBC. Or you can use other code that can handle it. As a side note. pxlib has a cli tool called pxview which dumps to SQL. That's the best starting point.

Related Solutions

Sql-server – Best way to re-import large amount of data with minimal downtime

A solution I've used in the past (and have recommended here and on StackOverflow before) is to create two additional schemas:

CREATE SCHEMA shadow AUTHORIZATION dbo;
CREATE SCHEMA cache  AUTHORIZATION dbo;

Now create a mimic of your table in the cache schema:

CREATE TABLE cache.IPLookup(...columns...);

Now when you are doing your switch operation:

TRUNCATE TABLE cache.IPLookup;
BULK INSERT cache.IPLookup FROM ...;

-- the nice thing about the above is that it doesn't really
-- matter if it takes one minute or ten - you're not messing
-- with a table that anyone is using, so you aren't going to
-- interfere with active users.


-- this is a metadata operation so extremely fast - it will wait
-- for existing locks to be released, but won't block new locks
-- for very long at all:

BEGIN TRANSACTION;
  ALTER SCHEMA shadow TRANSFER    dbo.IPLookup;
  ALTER SCHEMA dbo    TRANSFER  cache.IPLookup;
COMMIT TRANSACTION;


-- now let's move the shadow table back over to
-- the cache schema so it's ready for next load:

ALTER SCHEMA cache TRANSFER shadow.IPLookup;
TRUNCATE TABLE cache.IPLookup; 

-- truncate is optional - I usually keep the data
-- around for debugging, but that's probably not
-- necessary in this case.

This will be more cumbersome if you have foreign keys and other dependencies (since you may have to drop those and re-create them), and of course it completely invalidates statistics etc. and this, in turn, can affect plans, but if the most important thing is getting accurate data in front of your users with minimal interruption, this can be an approach to consider.

Sql-server – Bulk Data Loading and Transaction Log

My first comment is that you are doing an ELT (Extract, Load, Transform) rather than an ETL (Extract, Transform, Load). While ELTs leverage set based relational advantages and can be very fast, they are sometimes very write intensive (hard on storage). Specifically, the t-log. This is because the transform is done on disk (typically an update or insert). I prefer ETL when possible, as the transform is done in the buffer and, when done correctly, requires minimal t-log writes. Buffer is cheap. Fast storage is not. For some bulk operations, the t-log is a non-value adding bottleneck.

Here are a few things that you're doing but I wouldn't recommend.

Bulk loading to tempdb. I'd recommend doing the bulk load on a real table in the destination database. Then you can size you're files accordingly and not worry about impacting tempdb.
Bundling independent procedures together. Split these two procedures up. The bulk load and the merge are independent of each other. Splitting them into individual procedures makes them more modular / unit testable.

It looks like you have the minimal logging rules covered pretty well. You're loading to an empty B-Tree with no non-clustereds, using tf 610, the ordering key specified, in bulk-logged mode. Outside of the temp table, everything looks ok here. As long as the file is actually ordered by the key, you should be good. Are you popping the log on tempdb or the user database?

On the merge statement:

UPDATES will always be fully logged. Are you changing a pretty significant portion of your table? If so, you might consider doing the merge in memory (SSIS data flow task or .Net) then bulk loading into a new table. This is more work, but most of the work is done in the buffer and minimal t-log is used. A minimally logged insert can be faster than a fully logged update if the portion of change is significant.

Since you're using tf 610, the insert can minimally log when using a tablock hint. See here for more info on merge with tablock: http://blogs.msdn.com/b/sqlserverstorageengine/archive/2010/06/03/minimal-logging-and-merge-statement.aspx Note, the update will still be fully logged if you go this route.