Sql-server – difference in execution plans on UAT and PROD server

performancequery-performancesql-server-2008-r2

I want to understand why there would be such a huge difference in execution of the same query on UAT (runs in 3 sec) vs PROD (run in 23 secs).

Both UAT and PROD are having exactly data and indexes.

QUERY:

set statistics io on;
set statistics time on;

SELECT CONF_NO,
       'DE',
       'Duplicate Email Address ''' + RTRIM(EMAIL_ADDRESS) + ''' in Maintenance',
       CONF_TARGET_NO
FROM   CONF_TARGET ct
WHERE  CONF_NO = 161
       AND LEFT(INTERNET_USER_ID, 6) != 'ICONF-'
       AND ( ( REGISTRATION_TYPE = 'I'
               AND (SELECT COUNT(1)
                    FROM   PORTFOLIO
                    WHERE  EMAIL_ADDRESS = ct.EMAIL_ADDRESS
                           AND DEACTIVATED_YN = 'N') > 1 )
              OR ( REGISTRATION_TYPE = 'K'
                   AND (SELECT COUNT(1)
                        FROM   CAPITAL_MARKET
                        WHERE  EMAIL_ADDRESS = ct.EMAIL_ADDRESS
                               AND DEACTIVATED_YN = 'N') > 1 ) )

ON UAT :

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 11 ms, elapsed time = 11 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

(3 row(s) affected)
Table 'Worktable'. Scan count 256, logical reads 1304616, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'PORTFOLIO'. Scan count 1, logical reads 84761, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CAPITAL_MARKET'. Scan count 256, logical reads 9472, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CONF_TARGET'. Scan count 1, logical reads 100, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 2418 ms,  elapsed time = 2442 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

enter image description here

On PROD :

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

(3 row(s) affected)
Table 'PORTFOLIO'. Scan count 256, logical reads 21698816, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CAPITAL_MARKET'. Scan count 256, logical reads 9472, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CONF_TARGET'. Scan count 1, logical reads 100, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 23937 ms,  elapsed time = 23935 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

enter image description here

Note that on PROD the query suggests a missing index and that is beneficial as I have tested, but that is not the point of discussion.

I just want to understand that :
ON UAT – why does sql server create a worker table and on PROD it does not ? It creates a table spool on UAT and not on PROD. Also, why are the execution times so different on UAT vs PROD ?

Note :

I am running sql server 2008 R2 RTM on both servers (pretty soon going to patch with latest SP).

UAT : Max memory 8GB. MaxDop, processor affinity and max worker threads is 0.

Logical to Physical Processor Map:
*-------  Physical Processor 0
-*------  Physical Processor 1
--*-----  Physical Processor 2
---*----  Physical Processor 3
----*---  Physical Processor 4
-----*--  Physical Processor 5
------*-  Physical Processor 6
-------*  Physical Processor 7

Logical Processor to Socket Map:
****----  Socket 0
----****  Socket 1

Logical Processor to NUMA Node Map:
********  NUMA Node 0

PROD : max memory 60GB. MaxDop, processor affinity and max worker threads is 0.

Logical to Physical Processor Map:
**--------------  Physical Processor 0 (Hyperthreaded)
--**------------  Physical Processor 1 (Hyperthreaded)
----**----------  Physical Processor 2 (Hyperthreaded)
------**--------  Physical Processor 3 (Hyperthreaded)
--------**------  Physical Processor 4 (Hyperthreaded)
----------**----  Physical Processor 5 (Hyperthreaded)
------------**--  Physical Processor 6 (Hyperthreaded)
--------------**  Physical Processor 7 (Hyperthreaded)

Logical Processor to Socket Map:
********--------  Socket 0
--------********  Socket 1

Logical Processor to NUMA Node Map:
********--------  NUMA Node 0
--------********  NUMA Node 1

UPDATE :

UAT Execution Plan XML :

http://pastebin.com/z0PWvw8m

PROD Execution Plan XML :

http://pastebin.com/GWTY16YY

UAT Execution Plan XML – with Plan generated fro PROD:

http://pastebin.com/74u3Ntr0

Server Configuration :

PROD: PowerEdge R720xd – Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz.

UAT: PowerEdge 2950 – Intel(R) Xeon(R) CPU X5460 @ 3.16GHz

I have posted at answers.sqlperformance.com

UPDATE :

Thanks to @swasheck for suggestion

Changing the max memory on PROD from 60GB to 7680 MB, I am able to generate the same plan in PROD. The query completes in same time as UAT.

Now I need to understand – WHY ? Also, by this, I wont be able to justify this monster server to replace the old server !

Best Answer

The potential size of the buffer pool affects plan selection by the query optimizer in a number of ways. As far as I know, hyper-threading does not affect plan choice (though the number of potentially available schedulers certainly can).

Workspace Memory

For plans that contain memory-consuming iterators like sorts and hashes, the size of the buffer pool (among other things) determines the maximum amount of memory grant that might be available to the query at runtime.

In SQL Server 2012 (all versions) this number is reported on the root node of a query plan, in the Optimizer Hardware Dependencies section, shown as Estimated Available Memory Grant. Versions prior to 2012 do not report this number in show plan.

The estimated available memory grant is an input to the cost model used by the query optimizer. As a result, a plan alternative that requires a large sorting or hashing operation is more likely to be chosen on a machine with a large buffer pool setting than on a machine with a lower setting. For installations with a very large amount of memory, the cost model can go too far with this sort of thinking - choosing plans with very large sorts or hashes where an alternative strategy would be preferable (KB2413549 - Using large amounts of memory can result in an inefficient plan in SQL Server - TF2335).

Workspace memory grant is not a factor in your case, but it is something worth knowing about.

Data Access

The potential size of the buffer pool also affects the optimizer's cost model for data access. One of the assumptions made in the model is that every query starts with a cold cache - so the first access to a page is assumed to incur a physical I/O. The model does attempt to account for the chance that repeated access will come from cache, a factor that depends on the potential size of the buffer pool among other things.

The Clustered Index Scans in the query plans shown in the question are one example of repeated access; the scans are rewound (repeated, without a change of correlated parameter) for each iteration of the nested loops semi join. The outer input to the semi join estimates 28.7874 rows, and the query plan properties for these scans shows estimated rewinds at 27.7874 as a result.

Again, in SQL Server 2012 only, the root iterator of the plan shows the number of Estimated Pages Cached in the Optimizer Hardware Dependencies section. This number reports one of the inputs to the costing algorithm that looks to account for the chance of repeated page access coming from cache.

The effect is that an installation with a higher configured maximum buffer pool size will tend to reduce the cost of scans (or seeks) that read the same pages more than once more than an installation with a smaller maximum buffer pool size.

In simple plans, the cost reduction on a rewound scan can be seen by comparing (estimated number of executions) * (estimated CPU + estimated I/O) with the estimated operator cost, which will be lower. The calculation is more complex in the example plans due to the effect of the semi join and union.

Nevertheless, the plans in the question appear to show a case where the choice between repeating the scans and creating a temporary index is quite finely balanced. On the machine with a larger buffer pool, repeating the scans is costed slightly lower than creating the index. On the machine with a smaller buffer pool, the scan cost is reduced by a smaller amount, meaning the index spool plan looks slightly cheaper to the optimizer.

Plan Choices

The optimizer's cost model makes a number of assumptions, and contains a great number of detailed calculations. It is not always (or even usually) possible to follow all the details because not all the numbers we would need are exposed, and the algorithms can change between releases. In particular, the scaling formula applied to take account of the chance of encountering a cached page is not well known.

More to the point in this particular case, the optimizer's plan choices are based on incorrect numbers anyway. The estimated number of rows from the Clustered Index Seek is 28.7874, whereas 256 rows are encountered at runtime - almost an order of magnitude out. We cannot directly see the information the optimizer has about the expected distribution of values within those 28.7874 rows, but it is very likely to be horribly wrong as well.

When estimates are this wrong, plan selection and runtime performance are essentially no better than chance. The plan with the index spool happens to perform better than repeating the scan, but it is quite wrong to think that increasing the size of the buffer pool was the cause of the anomaly.

Where the optimizer has correct information, the chances are much better that it will produce a decent execution plan. An instance with more memory will generally perform better on a workload than another instance with less memory, but there are no guarantees, especially when plan selection is based on incorrect data.

Both instances suggested a missing index in their own way. One reported an explicit missing index, and the other used an index spool with the same characteristics. If the index provides good performance and plan stability, that might be enough. My inclination would be to rewrite the query as well, but that's probably another story.

Subqueries in CASE expressions

Consider the following (perfectly legal) query:

DECLARE @Base AS TABLE (a integer NULL);
DECLARE @When AS TABLE (b integer NULL);
DECLARE @Then AS TABLE (c integer NULL);
DECLARE @Else AS TABLE (d integer NULL);

SELECT
    CASE
        WHEN (SELECT W.b FROM @When AS W) = 1
            THEN (SELECT T.c FROM @Then AS T)
        ELSE (SELECT E.d FROM @Else AS E)
    END
FROM @Base AS B;

The semantics of CASE are that WHEN/ELSE clauses are generally evaluated in textual order. In the query above, it would be incorrect for SQL Server to return an error if the ELSE subquery returned more than one row, if the WHEN clause was satisfied. To respect these semantics, the optimizer produces a plan that uses pass-through predicates:

Pass-through predicates

The inner side of the nested loop joins are only evaluated when the pass-through predicate returns false. The overall effect is that CASE expressions are tested in order, and subqueries are only evaluated if no previous expression was satisfied.

CASE expressions with an EXISTS subquery

Where a CASE subquery uses EXISTS, the logical existence test is implemented as a semi-join, but rows that would normally be rejected by the semi-join have to be retained in case a later clause needs them. Rows flowing through this special kind of semi-join acquire a flag to indicate if the semi-join found a match or not. This flag is known as the probe column.

The details of the implementation is that the logical subquery is replaced by a correlated join ('apply') with a probe column. The work is performed by a simplification rule in the query optimizer called RemoveSubqInPrj (remove subquery in projection). We can see the details using trace flag 8606:

SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8606);

Part of the input tree showing the EXISTS test is shown below:

ScaOp_Exists 
    LogOp_Project
        LogOp_Select
            LogOp_Get TBL: #T2
            ScaOp_Comp x_cmpEq
                ScaOp_Identifier [T2].ID
                ScaOp_Identifier [T1].ID

This is transformed by RemoveSubqInPrj to a structure headed by:

LogOp_Apply (x_jtLeftSemi probe PROBE:COL: Expr1008)

This is the left semi-join apply with probe described previously. This initial transformation is the only one available in SQL Server query optimizers to date, and compilation will simply fail if this transformation is disabled.

One of the possible execution plan shapes for this query is a direct implementation of that logical structure:

NLJ Semi Join with Probe

The final Compute Scalar evaluates the result of the CASE expression using the probe column value:

Compute Scalar expression

The basic shape of the plan tree is preserved when the optimize considers other physical join types for the semi join. Only merge join supports a probe column, so a hash semi join, though logically possible, is not considered:

Merge with probe column

Notice the merge outputs an expression labelled Expr1008 (that the name is the same as before is a coincidence) though no definition for it appears on any operator in the plan. This is just the probe column again. As before, the final Compute Scalar uses this probe value to evaluate the CASE.

The problem is that the optimizer doesn't fully explore alternatives that only become worthwhile with merge (or hash) semi join. In the nested loops plan, there is no advantage to checking if rows in T2 match the range on every iteration. With a merge or hash plan, this could be a useful optimization.

If we add a matching BETWEEN predicate to T2 in the query, all that happens is that this check is performed for each row as a residual on the merge semi join (tough to spot in the execution plan, but it is there):

SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
            AND T2.ID BETWEEN 5000 AND 7000 -- New
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000;

Residual predicate

We would hope that the BETWEEN predicate would instead be pushed down to T2 resulting in a seek. Normally, the optimizer would consider doing this (even without the extra predicate in the query). It recognizes implied predicates (BETWEEN on T1 and the join predicate between T1 and T2 together imply the BETWEEN on T2) without them being present in the original query text. Unfortunately, the apply-probe pattern means this is not explored.

There are ways to write the query to produce seeks on both inputs to a merge semi join. One way involves writing the query in quite an unnatural way (defeating the reason I generally prefer EXISTS):

WITH T2 AS
(
    SELECT TOP (9223372036854775807) * 
    FROM #T2 AS T2 
    WHERE ID BETWEEN 5000 AND 7000
)
SELECT 
    T1.ID, 
    DoesExist = 
        CASE 
            WHEN EXISTS 
            (
                SELECT * FROM T2 
                WHERE T2.ID = T1.ID
            ) THEN 1 ELSE 0 END
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000;

TOP trick plan

I wouldn't be happy writing that query in a production environment, it's just to demonstrate that the desired plan shape is possible. If the real query you need to write uses CASE in this particular way, and performance suffers by there not being a seek on the probe side of a merge semi-join, you might consider writing the query using different syntax that produces the right results and a more efficient execution plan.

Sql-server – Which of these queries is best for performance

Sometimes I wonder if SHORT scripts really is the best thing to focus on.

The size of a script has little to do with how efficiently the query will execute. A more compact statement will likely consume fewer resources in terms of compilation, but (re)compilation is usually a rare occurrence in a live system.

Fewer table accesses is usually desirable, though, and this does lead to more compact code.

Very generally speaking, a smaller execution plan will yield better results, and a lower estimated cost will yield better results. Again, though, it's highly situational. Cost estimates in particular can be way off in some cases. It's important to measure the actual execution time, because at the end of the day, that's what matters.

With left joins i can achieve what i want with just a few lines. But then I tried with a longer script, using unions. Which is the best method?

First of all, we need to know how much data will be in these tables in a real system. Right now there's so little it will be difficult to use the STATISTICS TIME performance metrics to figure out a winner -- the results that come back will be dominated by factors other than the query execution. With more data, it's likely the plans will change, thus rendering the comparison here moot.

Having said that, by looking at the query plans as they are now from a logical point of view, the first one is the winner.

You can see that the Clustered Index Scan of quantities appears once in the first plan, while it appears four times in the second one. The second plan also contains an expensive Distinct Sort as a result of using UNIONs (this operator could be eliminated by using UNION ALLs instead, which won't change the results).

The first query could also probably be improved, by getting index seeks on the colors and sizes tables, instead of table scans. It might be worth trying a hash match plan as well (which is what you'll probably see when quantities and products are larger), but for tables this small, the startup cost may be too much overhead to be of benefit.

What I would suggest you do is run each of the statements you want to test 10,000+ times in a loop, figure out the average execution time, and then compare.

UPDATE :

Best Answer

Workspace Memory

Data Access

Plan Choices

Related Solutions

Sql-server – Check existence with EXISTS outperform COUNT! … Not

Subqueries in CASE expressions

CASE expressions with an EXISTS subquery

Sql-server – Which of these queries is best for performance

Related Question