Sql-server – Pre-staging data causes execution plan cost to skyrocket

execution-planoptimizationperformancequery-performancesql serversql-server-2016

I have a troublesome query that we are trying to tune. One of our first thoughts was to take a portion of a larger execution plan and store those results to an intermediate temp table and then perform the other operations.

What I'm observing is that, when we prestage the data into a temp table, the execution plan cost goes through the roof (22 -> 1.1k). Now, this has the benefit of allowing the plan to go parallel, which reduced execution time by 20% but that isn't worth the much higher CPU usage per execution in our case.

We are using SQL Server 2016 SP2 with the legacy CE on.

Original plan (Cost ~20):

Low cost plan

https://www.brentozar.com/pastetheplan/?id=ry-QGnkCM

Original SQL:

WITH Object1(Column1, Column2, Column3, Column4, Column5, Column6)
AS
(
    SELECT  Object2.Column1, 
            Object2.Column2, 
            Object3.Column3, 
            Object3.Column4, 
            Object3.Column5, 
            Object3.Column6
    FROM Object4 AS Object5
    INNER JOIN Object6 AS Object2 ON Object2.Column2 = Object5.Column2 AND Object2.Column7 = 0
    INNER JOIN Object7 AS Object8 ON Object8.Column8 = Object2.Column9 AND Object8.Column7 = 0
    INNER JOIN Object9 AS Object3 ON Object3.Column10 = Object8.Column11 AND Object3.Column7 = 0
    INNER JOIN Object10 AS Object11 ON Object2.Column1 = Object11.Column1
    WHERE   Object8.Column12 IS NULL AND
            Object8.Column13 = Object5.Column13 AND 
            Object3.Column3 = Object5.Column3 AND 
            Object11.Column14 = Variable1
)
insert Object12
SELECT  Object13.Column2,
        Object13.Column3,
        MIN(Object13.Column4) AS Column15,
        MAX(Object13.Column4) AS Column16,
        COUNT(DISTINCT (CASE WHEN Object13.Column5 = 1 THEN Object13.Column1 END)) AS Column17,
        COUNT(DISTINCT (CASE WHEN Object13.Column6 = 0 THEN Object13.Column1 END)) AS Column18,
        COUNT(DISTINCT Object13.Column1) AS Column19
FROM Object1 AS Object13
GROUP BY Object13.Column2, Object13.Column3 OPTION (RECOMPILE) 

New plan (with area highlighted in blue above is pre-staged into a temp table – Cost ~1.1k):

enter image description here

https://www.brentozar.com/pastetheplan/?id=rycqG3JRf

New SQL:

SELECT  Object1.Column1,
        Object1.Column2,
        MIN(Object2.Column3) AS Column4,
        MAX(Object2.Column3) AS Column5,
        COUNT(DISTINCT (CASE WHEN Object2.Column6 = 1 THEN Object1.Column7 END)) AS Column8,
        COUNT(DISTINCT (CASE WHEN Object2.Column9 = 0 THEN Object1.Column7 END)) AS Column10,
        COUNT(DISTINCT Object1.Column7) AS Column11
from Object3 Object1
join Object4 Object2 on Object2.Column12 = Object1.Column13 and Object2.Column2 = Object1.Column2
where Object2.Column14 = 0
GROUP BY Object1.Column1, Object1.Column2 OPTION (RECOMPILE) 

Can someone help us understand why the new plan would have such a larger cost? I'll be happy to provide additional information regarding tables/indexes underneath, if needed.

In the case of the original plan, we do realize that it's doing an insert instead of select. Even so, the select shouldn't (in our minds) be that much more costly.

This is the actual execution plan. It's a concern because, due to the immensely higher plan cost, it goes parallel. Therefore using higher CPU. Also, we're just curious as to why the plan cost goes up that much for something like pre-staging the data, which usually will get you close to, if not better, than the original cost.

The temp table is indexed in the second query as a composite clustered PK on Object1.Column13 and Object1.Column2. This matches the columns (and order) of Object4. Adding a MAXDOP hint is an option, but this is also an academic exercise of 'why in the world does cost go up that much'?

Adding OPTION (ORDER GROUP) to the second query results in no change, same operators/costs.

NOTES:

  • Object9 in the first query is the same object as Object4 in the second.

Best Answer

Costs are based on estimates, even in "actual plans". You cannot compare two query plans side by side and conclude that one of them will require more CPU to execute based on operator or total plan costs alone. I can create a query with a cost in the millions that executes in one second. I can also create a query with a tiny cost that will effectively take forever to execute. For your case, the first query has a cost of only 22 optimizer units because of a poor cardinality estimate after the hash join:

enter image description here

The operators in red execute millions of times, but the query optimizer expects them to only execute a few thousand times. Costs, which are based on estimates, won't reflect that work. The operator in blue is a table spool for which the cardinality estimator expects to insert a single row. It instead inserts a few million. As a result, the operators in black (along with a few others not shown) are inefficient and spill to tempdb.

With the other plan, you put a significant number of rows into tempdb and as a result the cardinality estimate is more reasonable, though it's still not ideal:

enter image description here

The query optimizer expects to need to process many more rows and as a result the query plan has a higher cost. As a very general rule of thumb you may see improved performance with improved estimates, but it doesn't always work out how you'd like. Looking at the plan with a temp table I see a few areas of possible improvement:

  1. Load the full CTE from the original query into the temp table. A query with multiple distinct aggregates can be tough to optimize. Sometimes you'll get a query plan where all of the data is loaded into a spool (into tempdb) and some of the aggregates are applied separately to the spool. All of that work is always done in a serial zone in my experience. If you eliminate all of the joins in the query I believe that you won't get that optimization. The aggregates will just be applied to the temp table. That will save you the work of writing out nearly the same data to tempdb and the entire plan should be eligible for parallelism.

  2. Define the temp table as a heap and write to it with TABLOCK. It looks like right now you have a clustered index which means you aren't eligible for parallel insert.

  3. Consider making the query eligible for batch mode by using one of these tricks. Batch mode aggregates can be significantly more efficient with multiple distinct aggregates.

I would expect some combination of those steps to significantly improve runtime. Please note that I did a quick analysis partially because anonymized plans are difficult to interpret.