SQL Server – VLDB with Large Quantity of Data File Imports

importsql serverssis

I am a little out of my league with this scenario and hope this community can help get me on the right direction.

I have recently been tasked with re-designing and implementing a database application where the company I work for processes a large quantity of text files (sizes ranging from a 100MB to 3GB). We currently process approximately 600 of these files per year. This data needs to be accessible after our initial processing.

The current implementation has each file being imported via Application bulk load or SSIS to a "temp" database as individual tables. I would like to keep this process in place as it gives a good entry point for when manual entry into processing is needed.

I am in the process of trying to create a new MS SQL Server installation and architecture, and am looking for advice/articles which would cover a similar scenario as to best practices (partitioning, filegroups, etc.). I have worked in databases for a few years, but not at this level.

So if you have any experience in this type of scenario or know of a good article or book you could point me to, I would be extremely grateful. I just need a nudge in the right direction. Thank you in advance.

Best Answer

Below are general recommendations and your milage may vary depending on you workload running and your IO subsystem, server hardware configuration, etc.

At SSIS level, you can look into below things to speed up data read and data load :

  • Fast Parse Option along with its limitations.
  • Use the SQL Server Native Client 10.x OLE DB provider for an In-Memory, high performance connection
  • Set the Packet Size to 32767
  • Select the OLE DB Destination Data Access mode “Table or View – fast load” option

Refer to Speeding Up SSIS Bulk Inserts into SQL Server for more details.

Now coming to SQL Server - optimizing data loading :

  • Use Trace Flag 610 : inserts into a table with indexes will generally be a minimally logged operation. Tables with clustered indexes support multiple bulk load stream inserting concurrently, as long as these streams contain non-overlapping data. If data overlap is detected, streams will block (but not deadlock).
  • Look for table partitioning (Partitioned Tables and Indexes in SQL Server 2005) especially sliding window technique.
  • Look for -E startup parameter as well.

Look for optimizing network configuration as well :

  • Use fast NIC and switches.
  • Have the latest certified drivers for your NIC.
  • Enable full duplex.
  • Enable support for Jumbo frames. Use TCP

  • Chimney Offload. Use Receive Side Scaling (RSS).

Also, you should monitor Wait Statistics on the server especially SOS_SCHEDULER_YIELD resulting in scheduler contention on Servers having multiple CPUs running concurrent Bulk load operations and competing for the same CPU Cycles.

Refer to this excellent whitepaper The Data Loading Performance Guide which have everything that I mentioned with diagrams and examples.

Also, to automate the sliding window technique - creating staging tables, loading data into it and then switching the partitions, you can fully automate it using SQL Server Partition Management -- it has command line option as well -- available at CodePlex.