Sql-server – Pre-staging data causes execution plan cost to skyrocket

execution-planoptimizationperformancequery-performancesql serversql-server-2016

I have a troublesome query that we are trying to tune. One of our first thoughts was to take a portion of a larger execution plan and store those results to an intermediate temp table and then perform the other operations.

What I'm observing is that, when we prestage the data into a temp table, the execution plan cost goes through the roof (22 -> 1.1k). Now, this has the benefit of allowing the plan to go parallel, which reduced execution time by 20% but that isn't worth the much higher CPU usage per execution in our case.

We are using SQL Server 2016 SP2 with the legacy CE on.

Original plan (Cost ~20):

https://www.brentozar.com/pastetheplan/?id=ry-QGnkCM

Original SQL:

WITH Object1(Column1, Column2, Column3, Column4, Column5, Column6)
AS
(
    SELECT  Object2.Column1, 
            Object2.Column2, 
            Object3.Column3, 
            Object3.Column4, 
            Object3.Column5, 
            Object3.Column6
    FROM Object4 AS Object5
    INNER JOIN Object6 AS Object2 ON Object2.Column2 = Object5.Column2 AND Object2.Column7 = 0
    INNER JOIN Object7 AS Object8 ON Object8.Column8 = Object2.Column9 AND Object8.Column7 = 0
    INNER JOIN Object9 AS Object3 ON Object3.Column10 = Object8.Column11 AND Object3.Column7 = 0
    INNER JOIN Object10 AS Object11 ON Object2.Column1 = Object11.Column1
    WHERE   Object8.Column12 IS NULL AND
            Object8.Column13 = Object5.Column13 AND 
            Object3.Column3 = Object5.Column3 AND 
            Object11.Column14 = Variable1
)
insert Object12
SELECT  Object13.Column2,
        Object13.Column3,
        MIN(Object13.Column4) AS Column15,
        MAX(Object13.Column4) AS Column16,
        COUNT(DISTINCT (CASE WHEN Object13.Column5 = 1 THEN Object13.Column1 END)) AS Column17,
        COUNT(DISTINCT (CASE WHEN Object13.Column6 = 0 THEN Object13.Column1 END)) AS Column18,
        COUNT(DISTINCT Object13.Column1) AS Column19
FROM Object1 AS Object13
GROUP BY Object13.Column2, Object13.Column3 OPTION (RECOMPILE)

New plan (with area highlighted in blue above is pre-staged into a temp table – Cost ~1.1k):

https://www.brentozar.com/pastetheplan/?id=rycqG3JRf

New SQL:

SELECT  Object1.Column1,
        Object1.Column2,
        MIN(Object2.Column3) AS Column4,
        MAX(Object2.Column3) AS Column5,
        COUNT(DISTINCT (CASE WHEN Object2.Column6 = 1 THEN Object1.Column7 END)) AS Column8,
        COUNT(DISTINCT (CASE WHEN Object2.Column9 = 0 THEN Object1.Column7 END)) AS Column10,
        COUNT(DISTINCT Object1.Column7) AS Column11
from Object3 Object1
join Object4 Object2 on Object2.Column12 = Object1.Column13 and Object2.Column2 = Object1.Column2
where Object2.Column14 = 0
GROUP BY Object1.Column1, Object1.Column2 OPTION (RECOMPILE)

Can someone help us understand why the new plan would have such a larger cost? I'll be happy to provide additional information regarding tables/indexes underneath, if needed.

In the case of the original plan, we do realize that it's doing an insert instead of select. Even so, the select shouldn't (in our minds) be that much more costly.

This is the actual execution plan. It's a concern because, due to the immensely higher plan cost, it goes parallel. Therefore using higher CPU. Also, we're just curious as to why the plan cost goes up that much for something like pre-staging the data, which usually will get you close to, if not better, than the original cost.

The temp table is indexed in the second query as a composite clustered PK on Object1.Column13 and Object1.Column2. This matches the columns (and order) of Object4. Adding a MAXDOP hint is an option, but this is also an academic exercise of 'why in the world does cost go up that much'?

Adding OPTION (ORDER GROUP) to the second query results in no change, same operators/costs.

NOTES:

Object9 in the first query is the same object as Object4 in the second.

Best Answer

Costs are based on estimates, even in "actual plans". You cannot compare two query plans side by side and conclude that one of them will require more CPU to execute based on operator or total plan costs alone. I can create a query with a cost in the millions that executes in one second. I can also create a query with a tiny cost that will effectively take forever to execute. For your case, the first query has a cost of only 22 optimizer units because of a poor cardinality estimate after the hash join:

The operators in red execute millions of times, but the query optimizer expects them to only execute a few thousand times. Costs, which are based on estimates, won't reflect that work. The operator in blue is a table spool for which the cardinality estimator expects to insert a single row. It instead inserts a few million. As a result, the operators in black (along with a few others not shown) are inefficient and spill to tempdb.

With the other plan, you put a significant number of rows into tempdb and as a result the cardinality estimate is more reasonable, though it's still not ideal:

The query optimizer expects to need to process many more rows and as a result the query plan has a higher cost. As a very general rule of thumb you may see improved performance with improved estimates, but it doesn't always work out how you'd like. Looking at the plan with a temp table I see a few areas of possible improvement:

Load the full CTE from the original query into the temp table. A query with multiple distinct aggregates can be tough to optimize. Sometimes you'll get a query plan where all of the data is loaded into a spool (into tempdb) and some of the aggregates are applied separately to the spool. All of that work is always done in a serial zone in my experience. If you eliminate all of the joins in the query I believe that you won't get that optimization. The aggregates will just be applied to the temp table. That will save you the work of writing out nearly the same data to tempdb and the entire plan should be eligible for parallelism.
Define the temp table as a heap and write to it with TABLOCK. It looks like right now you have a clustered index which means you aren't eligible for parallel insert.
Consider making the query eligible for batch mode by using one of these tricks. Batch mode aggregates can be significantly more efficient with multiple distinct aggregates.

I would expect some combination of those steps to significantly improve runtime. Please note that I did a quick analysis partially because anonymized plans are difficult to interpret.

Wait Statistics

Seeing as we have to find out where SQL Server is waiting (if it is actually SQL Server) you will have to have a look at the Wait Statistics.

There is some code you can download from Paul Randal's SQLSkills.com site which provides you with the means to analyse the wait statistics.

Capturing wait statistics for a period of time

Instead of just running the code and modifying the delay period, open up a Query window and execute the first portion of the SQL statement up until the WAITFOR DELAY '00:30:00'; part.
Then execute whatever you have to do.
After the query has executed run the second part of Paul's script to output the differences found. (The part after the WAITFOR DELAY '00:30:00';)

Depending on the wait statistics you find in the results, you can find hints of what may be the issue in the SQL Server Wait Types Library on Paul's site.

Then go from there. Depending on your wait stats, you should see what your SQL Server is waiting for. If you don't find any big issues, then your application might be the limiting factor.

Disk Alignement

Seeing as you had many read aheads without the covering index, you might want to verify that your data disks are formatted with 64k block size. SQL Server reads in Extents which is explained in the Reading Pages information on Microsoft's site. (1 Extent = 8 Pages = 64kB)

You can also find information about disk alignment on Microsoft's Disk Partition Alignment Best Practices for SQL Server. This document will also explain how to retrieve the block size with the command:

fsutil fsinfo ntfsinfo c:

Results will look similar to this (check the bytes per cluster value)

NTFS Volume Serial Number :       0xa2060a7f060a54a7
Version :                         3.1
Number Sectors :                  0x00000000043c3f5f
Total Clusters :                  0x000000000008787e
Free Clusters  :                  0x000000000008746e
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Cluster :               65536
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000010000
Mft Start Lcn  :                  0x000000000000c000
Mft2 Start Lcn :                  0x0000000000043c3f
Mft Zone Start :                  0x000000000000c000
Mft Zone End   :                  0x000000000001cf20

Having the MDF/NDF and LDF files on 64k formatted drives can help improve performance regarding read ahead speed.

An appropriate value for most installations should be 65,536 bytes (that is, 64 KB) for partitions on which SQL Server data or log files reside. In many cases, this is the same size for Analysis Services data or log files, but there are times where 32 KB provides better performance. To determine the right size, you will need to do performance testing with your workload comparing the two different block sizes.

You should be able to pinpoint some of your performance issues with these guidelines.

Disclaimer: I am in no way affiliated with Microsoft, Paul Randall, or SQLSkills.com.

Best Answer

Related Solutions

Sql-server – intermittent performance issue – query with multiple CTE and trashed execution plan

Sql-server – Does IMAGE column affect query performance even if it’s not included in the query

Wait Statistics

Disk Alignement

Related Question