I know it's an old Question but this might still help searchers and it's a problem that pops up every now and then.
The main reason why you are hitting a performance ceiling without you seeing any resource bottleneck is because you've reached the limit of what is possible to process within one session single thread. The loop isn't processed in parallel, but all inserts are done serially.
In my case, it takes 36 seconds to insert 3 million rows. That means 36/30000000 = 0.000012 seconds per row. That's pretty fast. On my system, it simply takes 0.000012 to go through all the steps that are necessary.
The only way to get it done faster is start up a second session in parallel.
If I start 2 sessions in parallel both doing 15 million inserts. Both of them finish in 18 seconds. I could scale out more, but my current test setup is hitting 95% cpu with two parallel session, so doing 3 would skew the results since I would hit a CPU bottleneck.
If I start 2 parallel session both inserting 3 million rows, they both finish in 39 seconds. so that is now 6 million rows in 39 seconds.
Okay, that still leaves us with the NETWORK_IO wait showing up.
The NETWORK_IO waits are added by the fact that you are using extended events to trace them. In my case the insert takes 36 seconds (on avg). When using the extended event way (from the link above in the very first comment) this is what is registered:
Wait Type Wait Count Total Wait Time (ms) Total Resource Wait Time (ms) Total Signal Wait Time (ms)
NETWORK_IO 3455 68808 68802 6
PAGEIOLATCH_SH 3 64 64 0
PAGEIOLATCH_UP 12 58 58 0
WRITE_COMPLETION 8 15 15 0
WRITELOG 3 9 9 0
PAGELATCH_UP 2 4 4 0
SOS_SCHEDULER_YIELD 32277 1 0 1
IO_COMPLETION 8 0 0 0
LATCH_SH 3 0 0 0
LOGBUFFER 1 0 0 0
You can see that 68 seconds of NETWORK_IO is registered. But since the insert loop is a single threaded action that took 36 seconds, this can't be. (Yes, multiple threads are used, but the operations are serial, never in parallel, so you can't acummulate more wait time than the total duration of the query)
If I don't use extended events but just the wait stats DMVs on a quiet instance (with just me running the insert) I get this:
Wait Type Wait Count Total Wait Time (ms) Total Resource Wait Time (ms) Signal Resource Wait Time (ms)
SOS_SCHEDULER_YIELD 8873 0.21 0.01 0.20
PAGEIOLATCH_UP 3 0.02 0.02 0.00
PREEMPTIVE_OS_AUTHENTICATIONOPS 17 0.02 0.02 0.00
PAGEIOLATCH_SH 1 0.00 0.00 0.00
So the NETWORK_IO you were seeing in the extended events log, wasn't related to your insert loop. (If you wouldn't turn nocount on, you would have massive async network IO waits, +1 Martin)
However I don't know why the NETWORK_IO show up in the extended event trace. Sure the writing out to a async file target of the events accumulates ASYNC_NETWORK_IO, but surely this is all done on a differenent SPID then the one we are filtering on. I might ask this as a new question myself)
Your question can be basically rephrased as 'How does the query memory grant work?'. A good read on the subject is Understanding SQL server memory grant. Before a query is launched into execution it may require a memory grant for sorts and hashes and other memory hungry operations. This memory grant is an estimate. Based on current system state (number of requests running and pending, memory available etc) the system grants the query a memory grant up to the required amount. Once the memory is granted, the query starts execution (it may have to wait in the dreaded 'resource semaphore' queue before it gets the grant). At execution it memory grant is guaranteed by the system. This amount of memory can be shared with data pages (since they can always flush to disk) but never with other memory usage (ie. it cannot be subject 'steal'). So when the query starts asking for committed memory from its grant, the engine will deploy what you call 'strategy #1': data pages may be evicted (flushed if dirty) in order to give the query the memory it was promised. Now if the estimate was correct and the grant was 100% of the requested memory, the query should not 'spill'. But if the estimate was incorrect (boils down to cardinality estimates, therefore is subject to stale stats) or if the query did not got the entire grant it had asked for, the query will 'spill'. This is when tempdb comes into picture and performance usually tanks.
The only knob you have at your disposal that controls something in this process is the Resource Governor. Since the RG can be used to specify a MIN setting for a pool, it can be used to reserve memory for a certain workload so that it actually gets the memory grant it requests. Of course, after you did the proper investigation that shows that reduced memory grants are the culprit, and of course after the impact on other workloads was evaluated. And tested, of course.
Now lets go back to your original question. If your investigation is correct (a very big if) I would like to point out two problems:
- you run in production queries that require memory grants for a web site. This is a big no-no. Memory grants are indicative of analytical queries which have no place in serving HTTP requests.
- your queries are probably not event getting the memory grant they request. Again, even more of a no-no for a latency critical workload as web sites are.
So what that tells me is that you have a fundamental design and architectural problem. Web sites are latency driven and should create an OLTP like workload, with no memory grants and with no memory pressure on queries. Not to mention no spills. Analytical queries should be run in offline jobs and store the pre-processed results for quick availability when HTTP requests desire them.
Best Answer
Given the code in your answer, you would most likely improve performance by doing the following two changes:
Start the query batch with
BEGIN TRAN
and end the batch withCOMMIT TRAN
:Decrease the number of updates per batch to less than 5000 to avoid lock escalation (which generally occurs at 5000 locks). Try 4500.
Doing those two things should decrease the massive amount of tran log writes and lock / unlock operations that you are currently generating by doing individual DML statements.
Example:
UPDATE
The Question is a bit sparse on the details. The example code is only shown in an answer.
One area of confusion is that the description mentions updating two columns, yet the example code only shows a single column being updated. My answer above was based on the code, hence it only shows a single column. If there really are two columns to update, then both columns should be updated in the same
UPDATE
statement:Another issue that is unclear is where is the "unique" data coming from? The Question mentions that the unique values are GUIDs. Are these being generated in the app layer? Are they coming from another data source that the app layer knows about and the database does not? This is important because, depending on the answers to these questions, it might make sense to ask:
If "yes" to #1 but the code, for whatever reason, needs to be generated in .NET, then you can use
NEWID()
and generateUPDATE
statements that work on ranges of rows, in which case you do not need theBEGIN TRAN
/ 'COMMIT` since each statement can handle all 4500 rows in one shot:If "yes" to #1 and there is no real reason for these UPDATEs to be generated in .NET, then you can just do the following:
The code above only works if the
ID
values are not sparse, or at least if the values do not have gaps larger than@BatchSize
, such that there is at least 1 row updated in each iteration. This code also assumes that theID
field is the Clustered Index. These assumptions seem reasonable given the provided example code.However, if the
ID
values do have large gaps, or if theID
field is not the Clustered Index, then you can just test for rows that do not already have a value:BUT, if "no" to #1 and the values are coming from .NET for a good reason, such as the unique values per each
ID
already exist in another source, then you can still speed this up (beyond my initial suggestion) by supplying a derived table:I believe the limit on the number of rows that can be joined via
VALUES
is 1000, so I grouped two sets together in an explicit transaction. You could test with up to 4 sets of theseUPDATE
s to do 4000 per transaction and keep below the limit of 5000 locks.