SQL Server – Long Running Query on Read-Only Replica vs Primary

availability-groupsperformancequery-performancesql serversql-server-2017

I've got a 4 node AG setup as follows:

VM Hardware Configuration of all nodes:

Microsoft SQL Server 2017 Enterprise Edition (RTM-CU14) (KB4484710)
16 vCPUs
356 GB RAM (long story to this one…)
max degree of parallelism: 1 (as required by app vendor)
cost threshold for parallelism: 50
max server memory (MB): 338944 (331 GB)

AG Configuration:

Node 1: Primary or Synchronous Commit Non-readable Secondary, Configured for Automatic Failover
Node 2: Primary or Synchronous Commit Non-readable Secondary, Configured for Automatic Failover
Node 3: Readable Secondary set with Asynchronous Commit, Configured for Manual Failover
Node 4: Readable Secondary set with Asynchronous Commit, Configured for Manual Failover

The Query In Question:

There's nothing terribly crazy about this query, it provides a summary of outstanding work items in various queues within the application. You can see the code from one of the execution plan links below.

Execution Behavior on the Primary Node:

When executed on the Primary node, the execution time is generally around the 1 second mark. Here is the execution plan, and below are stats captured from STATISTICS IO and STATISTICS TIME from the primary node:

(347 rows affected)
Table 'Worktable'. Scan count 647, logical reads 2491, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'workitemlc'. Scan count 300, logical reads 7125, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'schedulertask'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'wfschedulertask'. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'schedulerservice'. Scan count 1, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'schedulerworkerpool'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'itemlc'. Scan count 1, logical reads 26372, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

 SQL Server Execution Times:
   CPU time = 500 ms,  elapsed time = 656 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

Execution Behavior on the Read-Only Secondary Node:

When executing on either Read-Only Secondary Node (i.e. Node 3 or Node 4), this query uses the same execution plan (this is a different plan link) and roughly the same execution stats are shown (e.g. there may be a few more page scans as these results are always changing), but with the exception of CPU time, they look very similar. Here are stats captured from STATISTICS IO and STATISTICS TIME from the read-only secondary node:

(347 rows affected)
Table 'Worktable'. Scan count 647, logical reads 2491, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'workitemlc'. Scan count 300, logical reads 7125, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'schedulertask'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'wfschedulertask'. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'schedulerservice'. Scan count 1, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'schedulerworkerpool'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'itemlc'. Scan count 1, logical reads 26372, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

 SQL Server Execution Times:
   CPU time = 55719 ms,  elapsed time = 56335 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

Other Details:

I've also run both sp_WhoIsActive and Paul Randal's WaitingTasks.sql script on the secondary while this query is executing, but I am not see any waits occurring what-so-ever, which is frankly frustrating:

This also doesn't look to be a case of AG latency as the Synchronization status is actually quite good:

--https://sqlperformance.com/2015/08/monitoring/availability-group-replica-sync

SELECT 
       ar.replica_server_name, 
       adc.database_name, 
       ag.name AS ag_name, 
       drs.is_local, 
       drs.synchronization_state_desc, 
       drs.synchronization_health_desc, 
       --drs.last_hardened_lsn, 
       --drs.last_hardened_time, 
       drs.last_redone_time, 
       drs.redo_queue_size, 
       drs.redo_rate, 
       (drs.redo_queue_size / drs.redo_rate) / 60.0 AS est_redo_completion_time_min,
       drs.last_commit_lsn, 
       drs.last_commit_time
FROM sys.dm_hadr_database_replica_states AS drs
INNER JOIN sys.availability_databases_cluster AS adc 
       ON drs.group_id = adc.group_id AND 
       drs.group_database_id = adc.group_database_id
INNER JOIN sys.availability_groups AS ag
       ON ag.group_id = drs.group_id
INNER JOIN sys.availability_replicas AS ar 
       ON drs.group_id = ar.group_id AND 
       drs.replica_id = ar.replica_id
ORDER BY 
       ag.name, 
       ar.replica_server_name, 
       adc.database_name;

This query seems to be the worst offender. Other queries that also take sub-second times on the Primary Node may take 1 – 5 seconds on the Secondary node, and while the behavior is not as severe, it does look to be causing issues.

Finally, I have also looked at the servers and checked for external processes such as A/V Scans, external jobs generating unexpected I/O, etc. and have come up empty handed. I don't think this is being caused by anything outside of the SQL Server process.

The Question:

It's only noon where I'm at and it's already been a long day, so I suspect I'm missing something obvious here. Either that or we've got something misconfigured, which is possible as we've had a number of calls into the Vendor and MS related to this environment.

For all of my investigation, I just can't seem to find what is causing this difference in performance. I would expect to see some sort of wait occurring on the secondary nodes, but nothing. How can I further troubleshoot this to identify the root cause? Has anyone seen this behavior before and found a way to resolve it?

UPDATE #1
After swapping states of the third node (one of the Read-Only replicas) to non-readable and then back to readable as a test, that replica is still being held up by an open transaction, with any client queries displaying the HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING wait.

Running a DBCC OPENTRAN command yields the following results:

Oldest active transaction:
    SPID (server process ID): 420s
    UID (user ID) : -1
    Name          : QDS nested transaction
    LSN           : (941189:33148:8)
    Start time    : May  7 2019 12:54:06:753PM
    SID           : 0x0
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

When looking up this SPID in sp_who2, it shows it as a BACKGROUND process with QUERY STORE BACK listed as the command.

While we are able to take TLog backups, I suspect we are running into similar functionality of this resolved bug, so I plan on opening a ticket with MS about this particular issue today.

Depending on the outcome of that ticket, I will try to capture a call stack trace per Joe's suggestion and see where we go.

Final Update (Issue Self-Resolved)

After eclipsing the 52 hour mark of the Query Store transaction being open (as identified above), the AG decided to automatically failover. Before this happened, I did pull some additional metrics. Per this link, provided by Sean, the database in question had a very large version store dedicated to this database, specifically at one point I had recorded 1651360 pages in the reserved_page_count field and 13210880 for the reserved_space_kb value.

Per the ERRORLOGs, the failover occurred after a 5 minute deluge of transaction hardening failures related to QDS base transaction and QDS nested transaction transactions.

The failover did cause an outage of about 10 minutes in my case. The database is ~6TB in size and is very active, so that was actually pretty good in my opinion. While the new primary node was online during this time, no client queries could complete as they all were waiting on the QDS_LOADDB wait type.

After the failover, the version store numbers reduced to 176 for reserved_page_count and 1408 for reserved_space_kb. Queries against the Secondary Read-Only Replicas also began to execute as quickly as if they were run from the primary, so it looks like the behavior disappeared entirely, as a result of the failover.

Best Answer

This answer is in addition to Joe's answer as I can't be 100% certain it is the version store, however there is enough evidence so far to imply that to be part of the issue.

When a secondary replica is marked as readable a good steady state for versioning information needs to first be attained so that there is a known and good starting point for all read operations on the secondary. When this is waiting to transition and there are still open transactions on the primary, this will manifest as HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING and is also a good indicator that the primary does go through quite a bit of data churn (or at least someone has a really long open transaction which also isn't good). The longer the transactions are open and the more data changes there are, the more versioning will occur.

Secondary replicas achieve readable status by using snapshot isolation under the covers for the session, even though if you check your session information you'll see it show up at the default read committed. Since snapshots isolation is optimistic and uses the version store, all changes will need to be versioned. This is exacerbated when there are many running (and potentially long running) queries on the secondary while the churn of data is high on the primary. Generally this manifests only in a few tables for an OLTP system but it's completely application and workload dependent.

The version store itself is measured in generations and when a query is run which requires the use of the version store, the versioning record pointer is used to point to the TempDB chain of that row. I say chain, because it's a list of versions for that row and the entire chain must be walked sequentially to find the proper version based on the starting timestamp of the transaction so that the results are inline with the data at that given time.

If the version store have many generations for these rows due to long running transactions on the primary and secondary replicas, this will cause longer than average times for queries to run and generally in the form of higher CPU while all other items seemingly stay exactly the same - such as execution plan, statistics, rows returned, etc. The walking of the chain is almost a purely cpu operation, so when the chains become very long and the amount of rows returned is very high, you get a (not linear, but can be close) increase in time for the query.

The only thing that can be done is to limit the length of the transactions on both the primary and the secondary to make sure the version store isn't becoming too large in TempDB while having many generations. Attempts to clean up the version store happen roughly once a minute, however cleanup requires that all versions from the same generation no longer be needed before it can be removed and all future versions can't be cleaned until the oldest version is no longer needed. Thus, a long running query can cause the inability to effectively cleanup many unused generations.

Switching the replica in and out of readable mode will also clear out the version store as it is no longer readable.

There are other items that could also be at play, but this seems the most plausible given the current data and way the replica was reacting.

TempDB Versioning DMVs (not to be confused with ADR versioning).

Subqueries in CASE expressions

Consider the following (perfectly legal) query:

DECLARE @Base AS TABLE (a integer NULL);
DECLARE @When AS TABLE (b integer NULL);
DECLARE @Then AS TABLE (c integer NULL);
DECLARE @Else AS TABLE (d integer NULL);

SELECT
    CASE
        WHEN (SELECT W.b FROM @When AS W) = 1
            THEN (SELECT T.c FROM @Then AS T)
        ELSE (SELECT E.d FROM @Else AS E)
    END
FROM @Base AS B;

The semantics of CASE are that WHEN/ELSE clauses are generally evaluated in textual order. In the query above, it would be incorrect for SQL Server to return an error if the ELSE subquery returned more than one row, if the WHEN clause was satisfied. To respect these semantics, the optimizer produces a plan that uses pass-through predicates:

Pass-through predicates

The inner side of the nested loop joins are only evaluated when the pass-through predicate returns false. The overall effect is that CASE expressions are tested in order, and subqueries are only evaluated if no previous expression was satisfied.

CASE expressions with an EXISTS subquery

Where a CASE subquery uses EXISTS, the logical existence test is implemented as a semi-join, but rows that would normally be rejected by the semi-join have to be retained in case a later clause needs them. Rows flowing through this special kind of semi-join acquire a flag to indicate if the semi-join found a match or not. This flag is known as the probe column.

The details of the implementation is that the logical subquery is replaced by a correlated join ('apply') with a probe column. The work is performed by a simplification rule in the query optimizer called RemoveSubqInPrj (remove subquery in projection). We can see the details using trace flag 8606:

SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8606);

Part of the input tree showing the EXISTS test is shown below:

ScaOp_Exists 
    LogOp_Project
        LogOp_Select
            LogOp_Get TBL: #T2
            ScaOp_Comp x_cmpEq
                ScaOp_Identifier [T2].ID
                ScaOp_Identifier [T1].ID

This is transformed by RemoveSubqInPrj to a structure headed by:

LogOp_Apply (x_jtLeftSemi probe PROBE:COL: Expr1008)

This is the left semi-join apply with probe described previously. This initial transformation is the only one available in SQL Server query optimizers to date, and compilation will simply fail if this transformation is disabled.

One of the possible execution plan shapes for this query is a direct implementation of that logical structure:

NLJ Semi Join with Probe

The final Compute Scalar evaluates the result of the CASE expression using the probe column value:

Compute Scalar expression

The basic shape of the plan tree is preserved when the optimize considers other physical join types for the semi join. Only merge join supports a probe column, so a hash semi join, though logically possible, is not considered:

Merge with probe column

Notice the merge outputs an expression labelled Expr1008 (that the name is the same as before is a coincidence) though no definition for it appears on any operator in the plan. This is just the probe column again. As before, the final Compute Scalar uses this probe value to evaluate the CASE.

The problem is that the optimizer doesn't fully explore alternatives that only become worthwhile with merge (or hash) semi join. In the nested loops plan, there is no advantage to checking if rows in T2 match the range on every iteration. With a merge or hash plan, this could be a useful optimization.

If we add a matching BETWEEN predicate to T2 in the query, all that happens is that this check is performed for each row as a residual on the merge semi join (tough to spot in the execution plan, but it is there):

SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
            AND T2.ID BETWEEN 5000 AND 7000 -- New
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000;

Residual predicate

We would hope that the BETWEEN predicate would instead be pushed down to T2 resulting in a seek. Normally, the optimizer would consider doing this (even without the extra predicate in the query). It recognizes implied predicates (BETWEEN on T1 and the join predicate between T1 and T2 together imply the BETWEEN on T2) without them being present in the original query text. Unfortunately, the apply-probe pattern means this is not explored.

There are ways to write the query to produce seeks on both inputs to a merge semi join. One way involves writing the query in quite an unnatural way (defeating the reason I generally prefer EXISTS):

WITH T2 AS
(
    SELECT TOP (9223372036854775807) * 
    FROM #T2 AS T2 
    WHERE ID BETWEEN 5000 AND 7000
)
SELECT 
    T1.ID, 
    DoesExist = 
        CASE 
            WHEN EXISTS 
            (
                SELECT * FROM T2 
                WHERE T2.ID = T1.ID
            ) THEN 1 ELSE 0 END
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000;

TOP trick plan

I wouldn't be happy writing that query in a production environment, it's just to demonstrate that the desired plan shape is possible. If the real query you need to write uses CASE in this particular way, and performance suffers by there not being a seek on the probe side of a merge semi-join, you might consider writing the query using different syntax that produces the right results and a more efficient execution plan.

Sql-server – Which of these queries is best for performance

Sometimes I wonder if SHORT scripts really is the best thing to focus on.

The size of a script has little to do with how efficiently the query will execute. A more compact statement will likely consume fewer resources in terms of compilation, but (re)compilation is usually a rare occurrence in a live system.

Fewer table accesses is usually desirable, though, and this does lead to more compact code.

Very generally speaking, a smaller execution plan will yield better results, and a lower estimated cost will yield better results. Again, though, it's highly situational. Cost estimates in particular can be way off in some cases. It's important to measure the actual execution time, because at the end of the day, that's what matters.

With left joins i can achieve what i want with just a few lines. But then I tried with a longer script, using unions. Which is the best method?

First of all, we need to know how much data will be in these tables in a real system. Right now there's so little it will be difficult to use the STATISTICS TIME performance metrics to figure out a winner -- the results that come back will be dominated by factors other than the query execution. With more data, it's likely the plans will change, thus rendering the comparison here moot.

Having said that, by looking at the query plans as they are now from a logical point of view, the first one is the winner.

You can see that the Clustered Index Scan of quantities appears once in the first plan, while it appears four times in the second one. The second plan also contains an expensive Distinct Sort as a result of using UNIONs (this operator could be eliminated by using UNION ALLs instead, which won't change the results).

The first query could also probably be improved, by getting index seeks on the colors and sizes tables, instead of table scans. It might be worth trying a hash match plan as well (which is what you'll probably see when quantities and products are larger), but for tables this small, the startup cost may be too much overhead to be of benefit.

What I would suggest you do is run each of the statements you want to test 10,000+ times in a loop, figure out the average execution time, and then compare.

Best Answer

Related Solutions

Sql-server – Check existence with EXISTS outperform COUNT! … Not

Subqueries in CASE expressions

CASE expressions with an EXISTS subquery

Sql-server – Which of these queries is best for performance

Related Question