Sql-server – Where do this Constant Scan and Left Outer Join come from in a trivial SELECT query plan

azure-sql-databaseexecution-plansql server

I have this table:

CREATE TABLE [dbo].[Accounts] (
    [AccountId] UNIQUEIDENTIFIER UNIQUE NOT NULL DEFAULT NEWID(),
    -- WHATEVER other columns
);
GO
CREATE UNIQUE CLUSTERED INDEX [AccountsIndex]
    ON [dbo].[Accounts]([AccountId] ASC);
GO

This query:

DECLARE @result UNIQUEIDENTIFIER
SELECT @result = AccountId FROM Accounts WHERE AccountId='guid-here'

executes with a query plan consisting of a single Index Seek – as expected:

SELECT <---- Clustered Index Seek

This query does the same:

DECLARE @result UNIQUEIDENTIFIER
SET @result = (SELECT AccountId FROM Accounts WHERE AccountId='guid-here')

but it's executed with a plan where result of Index Seek is Left Outer Joined with result of some Constant Scan and then fed into Compute Scalar:

SELECT <--- Compute Scalar <--- Left Outer Join <--- Constant Scan
                                      ^
                                      |------Clustered Index Seek

What's that extra magic? What does that Constant Scan followed by Left Outer Join do?

Best Answer

The semantics of the two statements are different:

The first does not set the value of the variable if no row is found.
The second always sets the variable, including to null if no row is found.

The Constant Scan produces an empty row (with no columns!) that will result in the variable being updated in case nothing matches from the base table. The left join ensures the empty row survives the join. Variable assignment can be thought of as happening at the root node of the execution plan.

Using `SELECT @result`

-- Set initial value
DECLARE @result uniqueidentifier = {guid 'FE2CA909-1162-4C6C-A7AC-33B257E28539'};

-- @result does not change
SELECT @result = AccountId 
FROM Accounts 
WHERE AccountId={guid '7AD4D33C-1ED7-4183-B7F3-48C33D666525'};

SELECT @result;

Using `SET @result`

-- Set initial value
DECLARE @result uniqueidentifier = {guid 'FE2CA909-1162-4C6C-A7AC-33B257E28539'};

-- @result set to null
SET @result = 
(
    SELECT AccountId 
    FROM Accounts 
    WHERE AccountId={guid '7AD4D33C-1ED7-4183-B7F3-48C33D666525'}
);

SELECT @result;

Execution plans

_{No row arrives at the root node, so no assignment occurs.}

_{A row always arrives at the root node, so variable assignment occurs.}

The extra Constant Scan and Nested Loops Left Outer Join are nothing to be concerned about. The join in particular is cheap since it is guaranteed to encounter one row on its outer input, and at most one row (in your example) on the inner input.

There are other ways to ensure a row is generated from the subquery to ensure a variable assignment occurs. One is to use a redundant scalar aggregate (no group by clause):

-- Set initial value
DECLARE @result uniqueidentifier = {guid 'FE2CA909-1162-4C6C-A7AC-33B257E28539'};

-- @result set to null
SET @result = 
    (
        SELECT MAX(AccountId)
        FROM Accounts 
        WHERE AccountId={guid '7AD4D33C-1ED7-4183-B7F3-48C33D666525'} 
    );
SELECT @result;

Notice the scalar aggregate produces a row even though it receives no input.

Documentation:

If the SELECT statement returns no rows, the variable retains its present value. If expression is a scalar subquery that returns no value, the variable is set to NULL.

For assigning variables, we recommend that you use SET @local_variable instead of SELECT @local_variable.

Related Solutions

Sql-server – SQL Server Index Scan Actual Executions

Schema and indexes are only one aspect of query plan and performance. Your statement "but with different data" is likely the source of the difference. The number of rows and the distribution of data is essential to the query optimizer. If you have significantly more rows in D2, or if the data is of entirely different characteristics (wider or narrower range of values), then you should expect to see different performance and execution plans.

For each set of statistics, SQL Server keeps a maximum of 200 samples. As the rows in the tables grow and the more irregular the distribution of values the more likely it is that SQL Server will not have enough information to generate optimal execution plans. That's where the use of filtered indexes and statistics comes into play.

If this is a parameterized query you may also be running into a parameter sniffing problem. Note that if you're using local variables the calculation changes also.

Sql-server – SHOWPLAN does not display a warning but “Include Execution Plan” does for the same query

This:

SET SHOWPLAN_XML ON;
GO
SELECT * FROM sys.objects;
GO

Is equivalent to pressing Display Estimated Execution Plan on the toolbar (or hitting Ctrl + L). You'll notice that no rows are returned from the query, like there is when you use Include Actual Execution Plan (Ctrl + M).

The spill warning is only a runtime warning. There is no way that SQL Server can know, when displaying the estimated plan, that a spill will happen at runtime. This is because a spill is caused by factors that might only be present during certain invocations of the query (for example, when there is memory pressure). The estimated plan knows roughly how much memory it's going to ask for, but it can't know until execution that it isn't going to get it.

As an aside, may I recommend* our free tool, SQL Sentry Plan Explorer? I think it provides much more obvious information than Management Studio. I recently wrote a lengthy blog post that can act as a tutorial, and Jonathan Kehayias has a great PluralSight course on it as well.

_{* Disclaimer: I work for SQL Sentry.}