Sql-server – Why are periodic restarts required to keep the instance performing well

configurationperformancesql serversql-server-2005

We've got a production DB server on SQL 2005. Everything runs normally for a while, but after a couple weeks we see a notable performance drop. Only restarting SQL Server brings performance back to normal.

Some background:

  • Running over 1200 databases (mostly single tenant, some multi-tenant). Before anyone lectures on moving to only multi-tenant, there are valid reasons for keeping this structure……
  • RAM is 16 GB. After restarting, it doesn't take too long for SQL Server to go back to 15 GB usage.
  • Active DB connections is about 80 connections – which we feel is fairly healthy considering there is one connection pool per web server per process – so we don't have a connection leak issue.

We've tried several things in non-peak times:
– Run DBCC DROPCLEANBUFFERS (with a CHECKPOINT) to clear data cache. It has no effect, nor does it clear any of the RAM usage).
– Run FREEPROCCACHE and FREESYSTEMCACHE to clear query plans and stored proc cache. No effect.

Obviously restarting SQL Server is not ideal in an active production environment. We're missing something. Anyone else go through this?

UPDATE: April-28-2012
Still battling this problem. I've lowered the memory for SQL Server to 10 GB, just to rule out any contention with the OS. I'm getting closer to narrowing it down, but need some help from my next step.

Here's what I found, after restarting SQL Server, the page file hovers between 12.3 GB and 12.5 GB. It'll stay that way for days. Total server threads will hang out between 850 and 930 – also stable and consistent for days on end (sqlserver is steadily between 55 and 85 of those depending on traffic).

Then, there's "an event". I have no idea what the event is, I can't see it in the logs, and I can't see anything consistent on the day of week or time it happens, but all of the suddent he pagefile jumps to either 14.1 or 14.2 GB, and the threads jump to between 1750 and 1785.

Checking perfom when this happens, over 900 of those threads are sqlserver. So I go to sp_who2 to see where are these threads are coming from…and there's just the used 80 or so db connections.

So….does anyone have any ideas how I can locate where the rest of these 900 threads on SQL server are, and what they are doing?

UPDATE: June-01-2012
Still battling the issue. For anyone reading this still, the issue with the threads jumping up has been resolved. This was caused by autodated ComVault backup software. It was creating a thread trying to backup databases that were no longer there (it was maintaining a list of previous databases) rather than just backing up the current databases.

But – the issue still remains, and we have to restart every week, give or take a few days. Working with the Rackspace team to see if they can shed any light.

Best Answer

You say that everything is fine, then after a couple of weeks, performance drops. (Usually, people claim that performance drops quickly, or at specific times, or at seemingly random intervals. That could mean bad I/O performance or lock storms or cpu-intensive queries running at wierd times, or a heavyweight scheduled job or lack of indexing or bad stats causing cpu-intensive queries or disk reads. Or other stuff.) Weeks is unusual.

My hypothesis is that another application on your server is leaking memory. I have seen this with virus software (every DBA's favorite server software villain) and 3rd party monitoring software. I would double check the memory usage of SQL Server, over time, and I'd grab all of the memory usage of all of the other applications on the box as well. If you have hard limits set on SQL Server's memory usage and have it set to not allow paging, it might be other apps that are getting paged out and eating up I/O capacity.

It's not hard to look for. If you are not already keeping metrics on the server, I would just start up Perfmon and have it grab a sample every 30 or 60 minutes. After a few days, you may see another applications memory usage creep upwards.

Are there error messages in the SQL Server log stating that "significant portions of sql server have been paged out"? That would also be a big clue.