Sql-server – SQL Server statements intermittently slow on SQL Server 2008 R2

performancesql-server-2008

On one of our customers, we've been having some performance issues on our application. It's a .NET 3.5 web app that is consuming and updating data on a SQL Server database. Currently our production environment consists of a Windows 2008 R2 machine as the front end, and a SQL Server 2008 R2 cluster on the back end. Our app uses COM+ and MSDTC to connect to the database.

Here's what's happening: our end users sometimes complain of slowness in the application. Some pages take some more time to load than would be expected. While trying to figure out what's happening, I managed to find out some strange behaviour on the database side that may be the cause for the performance degradation. I noticed that sometimes there are some SQL statements that take a lot more time to run that what would be expected. I managed to identify some of these statements (mainly it's invocations of some of our application's stored procedures) using a profiler trace (with TSQL_Duration template) to identify the long running queries.

The problem is that when I run these stored procedures directly on the database on SQL Management Studio sometimes they do take long (about 7/8 seconds), other times they are fast (under 1 sec.). I don't know why this happens and it's driving me nuts, because the SQL machine (4 core, 32 GB) is not being used by any other applications, and these queries should not take this long to run.

Not being a DBA or a SQL Server guru, I've been trying to look at some stuff that may help me understand the problem. Here's the steps I've taken to try and sort out the problem and what I found out so far:

  • All of the TSQL code called by the application is written in stored procedures.
  • I identified some of the long running queries on SQL Server profiler, however when I run these on Management Studio they either take long to run (from 4 to 10 secs.), or run quickly (under 1 sec.). I am running the exact same queries with the same data passed in the parameters. These queries are mainly stored procedures with select statements in them.
  • I tried looking at the waits and queues statistics to try and figure out if there are processes waiting on some resources. I ran the following query:
WITH Waits AS
    (SELECT
        wait_type,
        wait_time_ms / 1000.0 AS WaitS,
        (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
        signal_wait_time_ms / 1000.0 AS SignalS,
        waiting_tasks_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_wait_stats
    WHERE wait_type NOT IN (
        'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
        'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
        'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT',  'BROKER_TO_FLUSH',
        'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT',     'DISPATCHER_QUEUE_SEMAPHORE',
        'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
        'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        'BROKER_RECEIVE_WAITFOR', 'ONDEMAND_TASK_QUEUE', 'DBMIRROR_EVENTS_QUEUE',
        'DBMIRRORING_CMD', 'BROKER_TRANSMITTER', 'SQLTRACE_WAIT_ENTRIES',
        'SLEEP_BPOOL_FLUSH', 'SQLTRACE_LOCK')
    )
SELECT
    W1.wait_type AS WaitType, 
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S,
    CAST ((W1.ResourceS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgRes_S,
    CAST ((W1.SignalS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgSig_S
FROM Waits AS W1
    INNER JOIN Waits AS W2 ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount,    W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

Here's what I found out:

  • After I reset the stats using DBCC SQLPERF (about 1 or 2 hours after), the wait types I have most is SOS_SCHEDULER_YIELD and WRITELOG
  • Over time (after about 1 day execution), the wait types that happen the most on the database are CXPACKET (67%) and OLEDB (17%), even though the average wait time for each is not long. I also noticed that the longer running statements identified on SQL Profiler are calls to stored procedures that return more than one resultset (often 3). Can there be a paralellism problem here? Is there any way I can try to identify if this is the cause of the problem?
  • I've read somewhere that OLEDB waits can be caused by calls to OLEDB resources like linked servers. We do have a linked server to connect with a Indexing Services machine (MSIDXS), however none of the statements identified as long running make use of that linked server.
  • The higher average wait time I have is for LCK_M_X type waits (about 1.5 secs average), but these wait types don't happen very often compared to other types (for example, 64 LCK_M_X waits vs 10,823 CXPACKET waits on the same period of time).
  • One thing I noticed is that the MSDTC service is not clustered. SQL Server service is clustered but not MSDTC. Can there be a performance hit because of this? We are using MSDTC because our app uses Enterprise Services (DCOM) to access the database, but the servers were not installed and configured by us, but by our client.

Can anyone help me make some more sense of this data? Can anyone give me a hand on understanding what may be happening? Is there something I can do on the server to try and figure things out? Should I talk to the application development team?

Best Answer

Thank you for the detailed explanation of your problem (one of the best laid out questions actually).

WRITELOG is a very common type of wait, so don't worry about it. Looking at the SOS_SCHEDULER_YIELD indicate CPU pressure and also the CXPACKET, it is possible that there must be some missing indexes and you may be retrieving lot of data from the queries for an OLTP system. I suggest you to look at the Missing Indexes DMV and see if there are any indexes (almost sure there will be more than few) that are in the questionable procs.

http://sqlfool.com/2009/04/a-look-at-missing-indexes/

http://troubleshootingsql.com/2009/12/30/how-to-find-out-the-missing-indexes-on-a-sql-server-2008-or-2005-instance-along-with-the-create-index-commands/

Look for Jonathan Kehayias's post on sqlblog.com on this too.

Also, take a look at Parameter sniffing.

http://sommarskog.se/query-plan-mysteries.html

http://pratchev.blogspot.com/2007/08/parameter-sniffing.html

It's NOT a compete answer for your needs but a good starting point. Let us know if you need more details.