Sql-server – Bad row estimate following Compute Scalar operator in plan

cardinality-estimatesexecution-planoptimizationsql serversql server 2014

I'm struggling to understand where a row estimate is coming from in an execution plan.

declare
@BatchKey INT = 1, @ParentBatchKey INT = 1,
@QuoteRef varchar(50) = 'Q00018249',
@MpanRef varchar(50) = '1425431100004'


SELECT DISTINCT
        ISNULL(c.ContractReference,-1) AS [ContractReference] ,
        ISNULL(d_cd.ContractDetailsKey,-1) AS [ContractDetailsKey] ,
        -1 AccountManagerKey,
        -1 SegmentationKey,
        ISNULL(d_tpi.TpiKey,-1) AS [TpiKey] ,
        ISNULL(d_cu.CustomerKey,-1) AS [CustomerKey] ,
        ISNULL(d_p.ProductKey,-1) AS [ProductKey] ,
        -1 as PayPointKey,
        -1 AS [GspBandingKey], --Not used in Junifer ESOB
        ISNULL(d_pps.[ProductPricingStructureKey],-1) AS [ProductPricingStructureKey],
        ISNULL(d_tou.TouBandingKey,-1) AS [PricingStructureBandingKey],
        -1 AS [VolumePointCategoryKey],
        ISNULL(d_ppc.PowerPeriodCategoryKey,-1) AS [PowerPeriodCategoryKey],
        ISNULL(d_pcat.[PriceComponentAggregationTypeKey],-1) AS [PriceComponentAggregationTypeKey],
        -1 AS [MarginRateBandingKey], --Not used in Junifer ESOB
        -1 AS [DuosUrcBandingKey], --Not used in Junifer ESOB
        -1 AS [ConsumptionToleranceKey],
        ISNULL(d_mp.MeterPointKey,-1) AS [MeterPointKey] ,
        ISNULL(d.DateKey,-1) AS [ForecastDateKey] ,
        -1 AS [ForecastEFADateKey], 
        ISNULL(d_cw.DateKey,-1) AS [ContractWonDateKey] ,
        ISNULL(f.SiteVolumeKwh,0) AS [SiteVolume] ,
        ISNULL(f.GspVolumeKwh,0) AS [GspVolume] ,
        ISNULL(f.NbpVolumeKwh,0) AS [NbpVolume],
        @BatchKey,
        @ParentBatchKey,
        CAST(f.ForecastKey as NVARCHAR(100)) AS [SourceId]
FROM 
        [Electricity].[Forecast] f 
              INNER JOIN Electricity.ContractMeterPoint cmp ON cmp.MeterPointKey = f.MeterPointKey and cmp.ContractKey = f.ContractKey  
              INNER JOIN Electricity.Contract c on c.ContractKey = cmp.ContractKey 
        INNER JOIN Electricity.MeterPoint mp ON mp.MeterPointKey = cmp.MeterPointKey

        --INNER JOIN Electricity.ContractMeterPoint cmp ON cmp.MeterPointKey = mp.MeterPointKey and cmp.ContractKey = c.ContractKey 
        INNER JOIN Electricity.ProductBundle pb ON c.ProductBundleKey = pb.ProductBundleKey
        LEFT JOIN Electricity.Quote q ON c.QuoteKey = q.QuoteKey
        LEFT JOIN Gdf.Tender t ON q.TenderKey = t.TenderKey
        LEFT JOIN Gdf.Customer cu ON q.CustomerKey = cu.CustomerKey
        LEFT JOIN Electricity.ProductBundleAggregationType pbat ON pbat.ProductName = pb.BundleName
        LEFT JOIN Dimensional_DW.DimensionElectricity.Product d_p ON d_p.ProductDurableKey = pb.ProductBundleKey
        LEFT JOIN Dimensional_DW.Dimension.Tpi d_tpi ON d_tpi.TpiDurableKey = c.TpiKey
        LEFT JOIN Dimensional_DW.DimensionElectricity.ProductPricingStructure d_pps ON d_pps.ProductPricingStructureDurableKey = f.PriceStructureKey
        LEFT JOIN Dimensional_DW.DimensionElectricity.TouBanding d_tou ON d_tou.TouBandingDurableKey = f.PriceRateKey
        LEFT JOIN Dimensional_DW.DimensionElectricity.MeterPoint d_mp ON d_mp.MeterPointDurableKey = cmp.MeterPointKey
        LEFT JOIN Dimensional_DW.DimensionElectricity.PriceComponentAggregationType d_pcat ON d_pcat.[TnuosAggregationType] =pbat.[TNUoSAggType] AND d_pcat.[DuosAggregationType] =pbat.[DUoSFixedAvailAggType] AND d_pcat.[DuosUrcAggregationType] =pbat.[DUoSURCAggType] AND d_pcat.[BsuosAggregationType] =pbat.[BSUoSAggType] AND d_pcat.[ROAggregationType] =pbat.[ROAggType]
        LEFT JOIN Dimensional_DW.Dimension.Date AS d ON d.DateKey = CAST(CONVERT(NVARCHAR(8), f.DeliveryDate, 112) AS INT) 
        LEFT JOIN Dimensional_DW.Dimension.Date AS d_cw ON d_cw.DateKey = CAST(CONVERT(NVARCHAR(8), c.QuoteWonDate, 112) AS INT) 
        LEFT JOIN Dimensional_DW.DimensionElectricity.PowerPeriodCategory d_ppc ON d_ppc.HhPeriod = f.Period
        LEFT JOIN Dimensional_DW.Dimension.Customer d_cu ON d_cu.CustomerDurableKey = cu.CustomerKey
        LEFT JOIN Dimensional_DW.DimensionElectricity.ContractDetails d_cd ON d_cd.ContractDetailsDurableKey = c.ContractKey

WHERE  1=1
   and     c.ContractReference = @QuoteRef
   and c.QuoteWonDate IS NOT NULL 
   and c.QuoteKey <> -1
           --(SELECT distinct C.ContractKey FROM Electricity.Contract WHERE ContractReference = @QuoteRef and c.QuoteWonDate IS NOT NULL and c.QuoteKey <> -1)
                --(SELECT distinct C1.ContractKey FROM Electricity.Contract c1 WHERE c1.ContractReference = @QuoteRef and c1.QuoteWonDate IS NOT NULL and c1.QuoteKey <> -1)
        and mp.MpanID = @MpanRef
              --and c.ContractKey = 18235
              --and d.DateKey =  20180522
              order by [ForecastDateKey]

My problem is around nodeId 26, the scalar operator:

I'm unsure as to how the row estimate of 5 is being generated – this seems to then cascade down the plan to most other operators – the nested loop operators estimated execution counts further down the plan seem to all indicate ~5 estimated, then ~35k actual.

Why would the scalar operator be fed an estimate of ~14000 rows, then estimate an output of 5? Is this a problem or a red herring? Is it anything to do with the conversions it is performing? I can understand that affecting a join, but why would it affect the output of the conversion?

Best Answer

Why would the scalar operator be fed an estimate of ~14000 rows, then estimate an output of 5? Is this a problem or a red herring?

This is counter-intuitive, but a natural consequence of the way the query optimizer explores the plan space. As it generates new, logically-equivalent, alternatives for a particular plan operator or subtree, it may need to derive a new cardinality estimate.

Since estimation is a statistical process, there is no guarantee that estimates derived on logically-equivalent (but physically different) trees will produce the same number, in fact in the majority of cases, they won't. There is normally no obvious way to prefer one estimate over another.

When optimization reaches its end point, the best physical alternatives found are 'stitched together' to form the final plan. This plan can have 'inconsistencies' as a result, simply because estimates were computed on different logic structures at different times. For example, a Compute Scalar might have started out as a logical aggregate, which was later simplified.

I wrote more about this in my article Indexed Views and Statistics.

If you suspect the cardinality mis-estimate is affecting plan choice (in an important way), you may choose to split the query up manually or use hints. Materializing the small intermediate set at or around node 27 into a temporary table may well improve plan quality, since the optimizer can see accurate cardinality at that point and create automatic statistics. The query writer can also choose to add indexing to the temporary table.

Is it anything to do with the conversions it is performing? I can understand that affecting a join, but why would it affect the output of the conversion?

Not usually, no, though it is best to avoid conversions wherever possible. Certainly conversions can affect cardinality estimation, but there is little indication it is the cause here.

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Why does this query need a Row Count Spool operator? ... what specific optimization is it trying to provide?

The cust_nbr column in #existingCustomers is nullable. If it actually contains any nulls the correct response here is to return zero rows (NOT IN (NULL,...) will always yield an empty result set.).

So the query can be thought of as

SELECT p.*
FROM   #potentialNewCustomers p
WHERE  NOT EXISTS (SELECT *
                   FROM   #existingCustomers e1
                   WHERE  p.cust_nbr = e1.cust_nbr)
       AND NOT EXISTS (SELECT *
                       FROM   #existingCustomers e2
                       WHERE  e2.cust_nbr IS NULL)

With the rowcount spool there to avoid having to evaluate the

EXISTS (SELECT *
        FROM   #existingCustomers e2
        WHERE  e2.cust_nbr IS NULL)

More than once.

This just seems to be a case where a small difference in assumptions can make quite a catastrophic difference in performance.

After updating a single row as below...

UPDATE #existingCustomers
SET    cust_nbr = NULL
WHERE  cust_nbr = 1;

... the query completed in less than a second. The row counts in actual and estimated versions of the plan are now nearly spot on.

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT *
FROM   #potentialNewCustomers
WHERE  cust_nbr NOT IN (SELECT cust_nbr
                        FROM   #existingCustomers 
                       )

Zero rows are output as described above.

The Statistics Histograms and auto update thresholds in SQL Server are not granular enough to detect this kind of single row change. Arguably if the column is nullable it might be reasonable to work on the basis that it contains at least one NULL even if the statistics histogram doesn't currently indicate that there are any.

Sql-server – SQL Server 2014 – Compute scalar over computed indexed column

I believe the "computer scalar" operator in the plan where it is using the index is actually not being executed. On my test rig with 1,000,000 sample rows, which is shown below, the query without the non-clustered index is a lot slower than the query that uses the clustered index on the myhash column.

USE tempdb;
SET NOCOUNT ON;

IF EXISTS (SELECT 1
    FROM sys.objects o
    WHERE o.name = 'HTest'
        AND o.type = 'U')
DROP TABLE dbo.HTest;

CREATE TABLE dbo.HTest
(
    HTest_ID INT NOT NULL
        CONSTRAINT PK_HTest_ID
        PRIMARY KEY CLUSTERED
        IDENTITY(1,1)
    , V1 VARCHAR(255) NOT NULL
    , V2 VARCHAR(255) NOT NULL
    , V3 VARCHAR(255) NOT NULL
    , V4 VARCHAR(255) NOT NULL
);

INSERT INTO dbo.HTest (V1, V2, V3, V4)
SELECT TOP(1000000) 
    o1.name
    , o2.name
    , o3.name
    , o4.name
FROM sys.objects o1
    , sys.objects o2
    , sys.objects o3
    , sys.objects o4;

ALTER TABLE dbo.HTest
ADD MyHash AS(CAST(HASHBYTES('SHA1', V1 + V2 + V3 + V4) AS VARBINARY(20)));

Below we have two test runs, one without the index, and the 2nd one with an index on the MyHash column.

IF EXISTS (SELECT 1
    FROM sys.indexes i
    WHERE i.name = 'IX_Htest_MyHash'
)
DROP INDEX IX_Htest_MyHash
ON dbo.HTest;

PRINT (N'');
PRINT (N'-----set stats io on---------------------------------------------------');
PRINT (N'');

SET STATISTICS IO, TIME ON;

PRINT (N'');
PRINT (N'-----run 1 (no index)--------------------------------------------------');
PRINT (N'');

SELECT MyHash
FROM dbo.HTest;

PRINT (N'');
PRINT (N'-----end of run 1------------------------------------------------------');
PRINT (N'');

CREATE INDEX IX_Htest_MyHash
ON dbo.HTest(MyHash);

PRINT (N'');
PRINT (N'-----run 2 (with index)------------------------------------------------');
PRINT (N'');

SELECT MyHash
FROM dbo.HTest;

PRINT (N'');
PRINT (N'-----end of run 2------------------------------------------------------');
PRINT (N'');

SET STATISTICS IO, TIME OFF

The salient bits from the output of this is:

-----run 1 (no index)--------------------------------------------------

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.
Table 'HTest'. Scan count 1, logical reads 9409, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 1560 ms,  elapsed time = 7516 ms.

-----end of run 1------------------------------------------------------

-----run 2 (with index)------------------------------------------------

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'HTest'. Scan count 1, logical reads 4227, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 188 ms,  elapsed time = 4912 ms.

-----end of run 2------------------------------------------------------

As you can see, run 2 is clocking in with far less CPU time, and around half the reads of run 1.

The plan for run 1:

The plan for run 2:

Looking at the XML for the 2nd plan, you can see the <ComputeScalar> operator is actually just a lookup. Look at the 5th line <ScalarOperator ScalarString="[tempdb].[dbo].[HTest].[MyHash]">:

  <ComputeScalar>
    <DefinedValues>
      <DefinedValue>
        <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Column="MyHash" ComputedColumn="true" />
        <ScalarOperator ScalarString="[tempdb].[dbo].[HTest].[MyHash]">
          <Identifier>
            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Column="MyHash" ComputedColumn="true" />
          </Identifier>
        </ScalarOperator>
      </DefinedValue>
    </DefinedValues>
    <RelOp AvgRowSize="21" EstimateCPU="1.10016" EstimateIO="3.11572" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="1000000" LogicalOp="Index Scan" NodeId="1" Parallel="false" PhysicalOp="Index Scan" EstimatedTotalSubtreeCost="4.21587" TableCardinality="1000000">
      <OutputList>
        <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Column="MyHash" ComputedColumn="true" />
      </OutputList>
      <RunTimeInformation>
        <RunTimeCountersPerThread Thread="0" ActualRows="1000000" ActualRowsRead="1000000" ActualEndOfScans="1" ActualExecutions="1" />
      </RunTimeInformation>
      <IndexScan Ordered="false" ForcedIndex="false" ForceSeek="false" ForceScan="false" NoExpandHint="false" Storage="RowStore">
        <DefinedValues>
          <DefinedValue>
            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Column="MyHash" ComputedColumn="true" />
          </DefinedValue>
        </DefinedValues>
        <Object Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Index="[IX_Htest_MyHash]" IndexKind="NonClustered" />
      </IndexScan>
    </RelOp>
  </ComputeScalar>

Presumably, the plan for the 2nd query includes the base calculation for the computed column, even though the results actually come from the index without any calculations actually taking place.

Best Answer

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Sql-server – SQL Server 2014 – Compute scalar over computed indexed column

Related Question