SQL Server – Identifying the Biggest Bottleneck in Writing Data

optimizationparallelismperformancequery-performancesql server

I'm running an ETL process that is writing about 2 million rows to a SQL Server database.

I'm trying to optimize the time it takes for pure inserts (I guess updates is a different story).

I'm wondering what the biggest bottleneck is, or best way to reduce time, for basic inserts into a SQL database.

I mean, the first thing, probably, is the size of the data, right? Both the number of rows, the number of columns, and the data size in each column. Some of these may not be able to be minimized, the KB/ footprint of each row is one thing that can be potentially optimized, right?

What else can be optimized or is the largest factor? Is it the transmission medium? I mean, how much magnitude of difference is there between writing to a database that's on the same computer, vs writing across a web connection (that is robust, fast, and has a ping of 1 ms?).

Finally — why is it that multiple parallel connections to the database seem to speed up the process to a point? I mean, when I have 20 connections making inserts round-robin style, it's about 6-7x faster than one connection writing ALL the data. I'm curious why this is.

Right now I have 2.2 million rows totaling 2.7 GB. That's 1.23 kb per row.

Right now inserting 1000 rows at at time (1.23 MB) using 14 connections takes 6.7 seconds. That's a snail-paced 10.66 rows per second. Even assuming 1 connection would be just as fast (it isn't) that's 150 rows/ second at best, which is not exactly "fast" either. I'm writing over a super-fast, robust web connection b/c we can't have the ETL process on the same space as the data warehouse.

So .. how can I optimize for speed here?

The reason for 1000 rows at a time is because the data comes from pages of 1000 – but optimizing the parsing is a separate issue for now.

I do have one primary index I believe, but nothing too write-expensive. Right now I've simply been doing Monte Carlo like testing (try it and see what works) but I need something more focused.

Best Answer

You need to go an read:

You have to read each link. Really. The short story is that efficient loading must use bulk insert and achieve minimal logging. SSIS is by far the better tool for that, but you can achieve this also programatically. OleDB is the best, but SqlClient (C#) will also do, if needed. Things like row size, network speed and the like are unlikely to be your concern here, but you should always measure (and the articles linked will teach you how). Load performance should be completely orthogonal to existing DB size, if load speed decreases as the DB size increases you must be doing something wrong.

Finally --- why is it that multiple parallel connections to the database seem to speed up the process to a point? I mean, when I have 20 connections making inserts round-robin style, it's about 6-7x faster than one connection writing ALL the data. I'm curious why this is.

I'll be speculating, due to lack of any investigation and measurement. But most likely you are now blocked by commit flush rate: every time you COMMIT SQL has to stop and wait for the log to be written to disk. If you do not explicitly begin transactions then every statement must stop and wait. Adding more loaders causes causes better log utilization, the log gets committed for 20 writers at once. Read What is an LSN: Log Sequence Number.