Sql-server – What can cause slow queries with low resource usage

I am writing an application which builds a database on a SQL server. The application loads data into some staging tables, then executes a series of stored procedures which merge the staging tables into the main tables.

The problem I'm seeing is that the merge itself is consistently much slower than I expect given the resources available to the server (8 cores, DB stored on RAID0 array of two SSDs, 32GB RAM, no other server processes running on the machine or other users of the SQL Server instance).

To be specific, during the merge:

Server CPU usage is very low (a few percent)
There are essentially no writes and reads to the logical volume that the database is stored on (measured in perfmon)
In a 30s period during the merge, Brent Ozar's wait stats triage script (http://www.brentozar.com/responder/triage-wait-stats-in-sql-server/) reports a maximum wait time of only 0.6s (on the SOS_SCHEDULER_YIELD wait type)
Inspecting dm_io_virtual_file_stats for both tempdb and the main database before and after a 30s period during the merge shows that IO stall time increases by less than 0.1s, indicating that we are not blocked on IO
Activity Monitor confirms the IO system is lightly loaded: maximum IO response latency is only 9ms, with very low number of bytes transferred
Intel Performance Counter Monitor (http://www.intel.com/software/pcm) shows that we are not swamping RAM bandwidth because only about 500MB/s is being transferred (out of a benchmarked maximum on this machine of around 13GB/s)
Inspecting dm_exec_sessions for the merging session before and after a 30s period during the merge shows that:
- total_elapsed_time increased by ~10s (I assume this is not 30s because dm_exec_sessions only gets updated periodically)
- cpu_time increases by only 0.1s
- memory_usage is only 2 pages both before and after
- 20,000 logical_reads are performed in this 10s period (i.e. 2,000 a second). This seems very low to me given that that only represents 16MB/s of data transfer.
- 1196 writes and 36 reads are performed. (I'm not quite sure how to interpret this.)

SQL Profiler shows that it is the execution of statements of the following form that is slow (taking as much as 5s when there are ~1000 rows in the staging table stage.LastTradePrice):

;WITH Source AS 
(
    SELECT 
        RowNumber = 
            ROW_NUMBER() OVER (PARTITION BY GSAQuoteID, Date ORDER BY ID DESC)
        , Source.* 
    FROM 
        stage.[LastTradePrice] Source
)

INSERT INTO @Stage 
    (
    GSAQuoteID
    , Date
    , Value
    )
SELECT 
    Source.GSAQuoteID
    , Source.Date
    , Source.Value
FROM 
    Source
OUTER APPLY 
    (
    SELECT TOP 1 
        Value = 
            CASE WHEN ToSourceID IS NULL THEN Target.Value ELSE NULL END
    FROM 
        dbo.[LastTradePrice] Target
    WHERE 
        Target.GSAQuoteID = Source.GSAQuoteID 
    AND Target.Date = Source.Date
    ORDER BY 
        Target.FromSourceID DESC
    ) Target
WHERE 
    Source.RowNumber = 1
AND (
        (Source.Value IS NULL AND Target.Value IS NOT NULL) 
        OR 
        (Source.Value IS NOT NULL AND Target.Value IS NULL) 
        OR 
        (Source.Value <> Target.Value)
    );

This query is the first stage of the merge: it selects those elements of the staging table stage.LastTradePrice which are different from the data we already have in the main table dbo.LastTradePrice.

The query plan for this statement looks pretty much optimal (i.e. it sorts the staging table by (GSAQuoteID,Date,ID DESC), segments it by (GSAQuoteID,Date) and then does a single-row seek into the LastTradePrice table for the top row in each segment, before doing the final clustered index insert). More details:

stage.LastTradePrice has a clustered index on ID, a synthetic integer primary key. Rows occur in the order they are bulk inserted by the application
dbo.LastTradePrice has a clustered index on (GSAQuoteID, Date, FromSourceID)
There are about 56M rows in dbo.LastTradePrice, an average of 168 per GSAQuoteID and just over 1 per (GSAQuoteID, Date) pair.

The database itself is about 34GB in size and therefore should almost entirely fit into memory. This is confirmed by the very high buffer cache hit ratio of >=99.8% I see consistently.

This all leaves me a bit confused. If these slow statements aren't starved of CPU time, waiting on IO, waiting on a lock or starved of memory bandwidth, what actually is left as a possible cause of the slowdown?

The only thing that I think might be a bit unusual about the system is that there is a relatively large number of tables and stored procedures. For various reasons, each of the ~330 fields I wish to store (e.g. LastTradePrice, Volume, OpenPrice and lots more) are given their own table+staging table, each with a corresponding stored procedure that implements the merge. (All of the sprocs are generated from a template, so the merge logic is basically identical for all fields.).

I think the large number of sprocs may explain why I'm seeing what I believe is a relatively high "Stolen pages" count of 200,000 and plan cache hit ratio of only ~80% even though no adhoc SQL is being used. However, I don't see how this could explain my problem.

Best Answer

This is most likely caused by the MERGE operation in SQL Server being single threaded. No matter how many resources you throw at it, it wont run faster which is exactly what you are observing.

The solution is to manually parallelise the query by running multiple copies of the MERGE statement at the same time, each operating on their own subset of the data.

The problem is very common in data warehouses running SQL Server. There are a series of good design pattern in this document: http://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx (DISCLAIMER: This is written by yours truly, so I am shamelessly plugging this)

Best Answer

Related Solutions

Sql-server – Can a clustered index with low cardinality cause deadlocks

Mysql – Database inserts/updates slow only in Production

Related Question