Sql-server – Slower SQL Server performance after allocating more CPU and RAM

performancesql serversql-server-2008-r2windows-server

We have SQL Server 2008 R2 (10.50.1600) running on a virtual Windows 2008 R2 server. After upgrading the CPU from 1 core to 4 and the RAM from 4 gb to 10 gb, we've noticed the performance is worse.

Some observations I see:

  1. A query that took <5 seconds to run is now taking >200 seconds.
  2. The CPU is pegged at 100 with sqlservr.exe as the culprit.
  3. A select count(*) on a table with 4.6 million rows took over 90 seconds.
  4. The processes that are running on the server haven't changed. The only change was to increase the cpu and ram.
  5. Other sql servers have a static paging file where this server is set to manage it on its own.

Has anyone run into this issue before?

Per sp_BlitzErik, I ran

EXEC dbo.sp_BlitzFirst @SinceStartup = 1;

Giving me these results.

wait stats

Best Answer

There's a lot going on here, and most of it is pretty broad and vague.

  1. 2008R2 RTM came out on April 21, 2010. It's totally out of support. You'll want to prioritize getting on the latest Service Pack, which came out just about 3 years ago to the day. That way you'll be covered if you're hitting a weird bug or something. Head on over here to figure out what you need to download.

  2. Since you added vCPUs (from 1 to 4) and didn't change any settings, your queries can now go parallel. I know this sounds like they'll all be faster, but hang on!

  3. You may have added RAM, but you may not have changed Max Server Memory so your server can take advantage of it.

  4. Figure out what your server is waiting on. An open source project I work on provides free scripts to help you measure your SQL Server. Head on over here if you wanna give them a try.

You'll wanna grab sp_BlitzFirst to check out your server's wait stats. You can run it a couple ways.

This will show you what your server has been waiting on since it started up.

EXEC dbo.sp_BlitzFirst @SinceStartup = 1;

This will show you what queries are waiting on now, during a 30 second window.

EXEC dbo.sp_BlitzFirst @Seconds = 30, @ExpertMode = 1;

Once you figure out what queries are waiting on (there's a ton of stuff written about wait stats out there), you can start making changes to get things under control.

If you see them waiting on CXPACKET, that means your queries are going parallel, and maybe trampling over each other. If you hit this, you'll probably want to consider bumping Cost Threshold for Parallelism up to 50, and maybe dropping MAXDOP down to 2.

After this step is when you want to use something like sp_WhoIsActive or sp_BlitzWho (the latter is in the GitHub repo from earlier) to start capturing query plans. Aside from wait stats, they're one of the most important things you can look at to figure out what's going wrong.

You may also want to check out this article by Jonathan Kehayias about VMWare Counters to check out in relation to SQL Server.

Update

Reviewing the wait stats and boy are they weird. There's definitely something up with the CPUs. Your server is mostly sitting around bored, but when things heat up, things get bad. I'll try to break this down easily.

  1. You're hitting a poison wait called THREADPOOL. You don't have a ton of it, but that makes sense because your server isn't terribly active. I'll explain why in a minute.

  2. You have really long average waits on SOS_SCHEDULER_YIELD and CXPACKET. You're on a VM, so you're gonna wanna make sure that the SQL Server has reservations, or that the box isn't horribly oversubscribed. A noisy neighbor can really ruin your day here. You're also going to want to make sure that the server/VM guest/VM host aren't running in Balanced Power mode. This makes your CPUs spin down to unnecessarily low speeds, and they don't immediately spin back up to full speed.

  3. How do they tie in? With 4 CPUs you have 512 worker threads. Keep in mind, you had the same amount with a single CPU, but now that your queries can go parallel, they can consume many more worker threads. In your case 4 threads per parallel branch of a parallel query.

What's going parallel? Most likely everything. The default Cost Threshold for Parallelism is 5. That number was made the default sometime in the late 90s working on a desktop that looked like this.

NUTS

Granted, your hardware is smaller than most laptops, but you're still a bit ahead of that thing.

When lots of parallel queries get going, you're running out of those worker threads. When that happens, queries just sit around waiting for threads to get going. That's also where SOS_SCHEDULER_YIELD comes in. Queries are stepping off CPUs and not getting back on for a long time. I don't see any blocking waits, so you're most likely just all stuffed up on intra-query parallelism waits.

What can you do?

  1. Make sure nothing is in Balanced Power mode
  2. Change MAXDOP to 2
  3. Change cost threshold for parallelism to 50
  4. Follow the Jon K. article above to validate VM health
  5. Use the script called sp_BlitzIndex to look for any missing index requests.

For more thorough troubleshooting, check out the whitepaper I wrote for Google on hardware sizing in the cloud.

Hope this helps!