Sql-server – Explanation for second Compute Scalar

optimizationsql serversql-server-2019

I noticed a peculiar thing. This query

declare @t int = 3;

select distinct top(@t) 
    Number - Number + ABS(CHECKSUM( CAST(NEWID() as binary(16)) ))
from Numbers as n;

produces this plan

enter image description here

dbo.Numbers is a standard Numbers table – just a contiguous list of integers with a clustered primary key – containing 100k rows.

There are two Compute Scalar operators. The right-hand one computes Number - Number and defines an internal expression. The left-hand one performs the addition and internal function calls. Strangely there is only one Compute Scalar operator if it is rewritten as

Number + ABS(...) - Number

Is there something "interesting" going on here? Or is this a simple artefact of how the parser constructs the execution plan from memos? Both forms complete compilation in search(1). Forcing search(2) using TF8677 does not change the behaviour. TF8606 shows the separation in Tree After Project Normalization.

I don't see this as a problem. I'm just interested to know if there's something to learn here.

Some background

I want to produce a fixed-length list of random numbers. This

ABS(CHECKSUM(NEWID()))

seems to be the way to go. I have a Numbers table (with 100k rows), so using that to drive output seemed reasonable. I wanted sampling without replacement so a DISTINCT would ensure there were no duplicates and a TOP() will give the required resultset size. Putting it together I got

declare @t int = 3;

select distinct top(@t) 
    ABS(CHECKSUM(NEWID()))
from Numbers as n;

Sadly the plan was a clustered index scan with a Hash Match (Aggregate). Reading only the minumum number of rows required would be preferable, perhaps with a Flow Distinct to enforce uniqueness.

That quickly lead me to this question. I tried each of the suggestions there. None changed the plan except converting the base table column to a string, appending a string function's empty output and re-converting to an integer (I didn't try a CCI). This felt unsatisfactory.

Recognising that CHECKSUM returned and integer, and that the Numbers table held integers, I tried this no-op:

declare @t int = 3;

select distinct top(@t) 
    Number - Number + ABS(CHECKSUM( CAST(NEWID() as binary(16)) ))
from Numbers as n;

This gave the hoped-for plan seen above.

The shape is maintained for all parameter values from 1 to 100k.

For what it's worth I'm running

Microsoft SQL Server 2019 (RTM) – 15.0.2000.5 (X64) Sep 24 2019 13:48:23 Copyright (C) 2019 Microsoft Corporation Developer Edition (64-bit) on Windows 10 Home 10.0 (Build 18362: )

Intel Core i7-8550U @ 2GHz, 16GB RAM, SSD

Best Answer

There's nothing particularly interesting about the number of Compute Scalars. The compilation and optimization process often produces more of these than is strictly necessary. Conor Cunningham talked about this in general terms at one of the early SQLBits conferences. In an ideal world, unnecessary scalar computations would be collapsed at the end of optimization, but only a few cases are implemented.

On the background: the choice of Flow Distinct depends on the number of rows needed being less than the number of distinct values estimated.

Using a variable with TOP means a default guess of 100 rows needed, so we need the number of distinct values to be more than that.

How the cardinality estimator (CE) treats ABS(CHECKSUM(NEWID())) depends on version. The CE model introduced in SQL Server 2014 estimates this expression will produce one unique value, regardless of the number of times it is evaluated (number of Numbers table rows read in this case). This is not a very good estimate of course.

The pre-2014 CE model makes a better educated guess, giving an estimate well in excess of 100, and so a Flow Distinct. My Numbers table does have ten million rows though. You can help the equation along if necessary by hinting a low value for @t e.g. OPTIMIZE FOR (@t = 1).

You can hint the earlier CE model in any of the usual ways e.g. USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION') or USE HINT ('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_110').

For example:

DECLARE @T integer = 3;

SELECT DISTINCT TOP (@T)
    ABS(CHECKSUM(NEWID()))
FROM dbo.Numbers AS N
OPTION
(
    OPTIMIZE FOR (@T = 1),
    USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_110')
);

Plan Explorer plan