SQL Server – Using EXCEPT in a Recursive Common Table Expression

exceptexecution-planrecursivesql serversql-server-2008

Why does the following query return infinite rows? I would have expected the EXCEPT clause to terminate the recursion..

with cte as (
    select *
    from (
        values(1),(2),(3),(4),(5)
    ) v (a)
)
,r as (
    select a
    from cte
    where a in (1,2,3)
    union all
    select a
    from (
        select a
        from cte
        except
        select a
        from r
    ) x
)
select a
from r

I came across this while trying to answer a question on Stack Overflow.

Best Answer

See Martin Smith's answer for information about the current status of EXCEPT in a recursive CTE.

To explain what you were seeing, and why:

I'm using a table variable here, to make the distinction between the anchor values and recursive item clearer (it does not change the semantic).

DECLARE @V TABLE (a INTEGER NOT NULL)
INSERT  @V (a) VALUES (1),(2)
;
WITH rCTE AS 
(
    -- Anchor
    SELECT
        v.a
    FROM @V AS v

    UNION ALL

    -- Recursive
    SELECT
        x.a
    FROM
    (
        SELECT
            v2.a
        FROM @V AS v2

        EXCEPT

        SELECT
            r.a
        FROM rCTE AS r
    ) AS x
)
SELECT
    r2.a
FROM rCTE AS r2
OPTION (MAXRECURSION 0)

The query plan is:

Recursive CTE Plan

Execution starts at the root of the plan (the SELECT) and control passes down the tree to the Index Spool, Concatenation, and then to the top-level Table Scan.

The first row from the scan passes up the tree and is (a) stored in the Stack Spool, and (b) returned to the client. Which row is first is not defined, but let us assume it is the row with the value {1}, for the sake of argument. The first row to appear is therefore {1}.

Control again passes down to the Table Scan (the Concatenation operator consumes all rows from its outermost input before opening the next one). The scan emits the second row (value {2}), and this again passes up the tree to be stored on the stack and output to the client. The client has now received the sequence {1}, {2}.

Adopting a convention where the top of the LIFO stack is on the left, the stack now contains {2, 1}. As control again passes to the Table Scan, it reports no more rows, and control passes back to the Concatenation operator, which opens it's second input (it needs a row to pass up to the stack spool), and control passes to the Inner Join for the first time.

The Inner join calls the Table Spool on its outer input, which reads the top row from the stack {2} and deletes it from the worktable. The stack now contains {1}.

Having received a row on its outer input, the Inner Join passes control down its inner input to the Left Anti-Semi Join (LASJ). This requests a row from its outer input, passing control to the Sort. Sort is a blocking iterator, so it reads all rows from the table variable and sorts them ascending (as it happens).

The first row emitted by the Sort is therefore the value {1}. The inner side of the LASJ returns the current value of the recursive member (the value just popped off the stack), which is {2}. The values at the LASJ are {1} and {2} so {1} is emitted, since the values do not match.

This row {1} flows up the query plan tree to the Index (Stack) Spool where it is added to the stack, which now contains {1, 1}, and emitted to the client. The client has now received the sequence {1}, {2}, {1}.

Control now passes back to the Concatenation, back down the inner side (it returned a row last time, might do again), down through the Inner Join, to the LASJ. It reads its inner input again, obtaining the value {2} from the Sort.

The recursive member is still {2}, so this time the LASJ finds {2} and {2}, resulting in no row being emitted. Finding no more rows on its inner input (the Sort is now out of rows), control passes back up to the Inner Join.

The Inner Join reads its outer input, which results in the value {1} being popped off the stack {1, 1}, leaving the stack with just {1}. The process now repeats, with the value {2} from a new invocation of the Table Scan and Sort passing the LASJ test and being added to the stack, and passes to the client, which has now received {1}, {2}, {1}, {2}...and round we go.

My favourite explanation of the Stack spool used in recursive CTE plans is Craig Freedman's.

Related Solutions

Sql-server – using CASE with ISNULL

CASE is an expression that returns a single value. You are attempting to use it as control of flow logic to optionally include a filter, and it doesn't work that way. Maybe something like this would work:

AND COALESCE(Project.Contribution, 0) >= CASE @BudgetFilter 
  WHEN 0 THEN 25000 ELSE 2000000000 END
-- guessing that 2 billion is enough to always be greater
-- than the highest valid contribution

Expanding to your three conditions, you need to use multiple expressions because you can't use CASE, again, to change >= to <. So something like:

AND COALESCE(Project.Contribution, 0) >= CASE @BudgetFilter
  WHEN 0 THEN 25000 ELSE 2000000000 END
AND COALESCE(Project.Contribution, 0) < CASE
  @BudgetFilter WHEN 1 THEN 25000 WHEN 2 THEN 0 ELSE -2000000000 END)

Or you can use the OR conditionals as mentioned in the comments:

AND ((@BudgetFilter = 0 AND Project.Contribution >= 25000)
  OR (@BudgetFilter IN (1,2) AND Project.Contribution < CASE
    @BudgetFilter WHEN 1 THEN 25000 ELSE 0 END))

Or you can just use dynamic SQL, which may be useful if you have a lot of other criteria and/or you are having difficulty getting consistent behavior from the plans for the above variations:

DECLARE @sql NVARCHAR(MAX) = N'SELECT ...
   WHERE ...
   AND COALESCE(Project.Contribution, 0) '
+ CASE @BudgetFilter
  WHEN 0 THEN ' >= 25000'
  WHEN 1 THEN ' < 25000'
  WHEN 2 THEN ' < 0' END + ';';

EXEC sp_executesql @sql;

(And in case you're wondering why I use COALESCE instead of ISNULL. That said, the COALESCE/ISNULL is really meaningless here, because the COALESCEd value (0) will only potentially meet the < 25000 criteria - and I'm not sure that's what you intended at all.)

Sql-server – Optimizing removal of duplicate rows from recursive CTE query

How about this? Kind of a hack - but row_number should be quicker than distinct.

;WITH Lineage AS
(
SELECT
    Name
   ,CASE WHEN ROW_NUMBER() OVER (PARTITION BY ParentName ORDER BY ParentName) > 1 
         THEN NULL 
         ELSE ParentName END AS ParentNameToJoin
   ,ParentName AS ParentNameToDisplay 
FROM
    #Project
WHERE
    Name in ('C1','C2')
UNION ALL
SELECT
    p.Name
   ,CASE WHEN ROW_NUMBER() OVER (PARTITION BY p.ParentName ORDER BY p.ParentName) > 1 
         THEN NULL 
         ELSE ParentName END AS ParentNameToJoin
   ,p.ParentName 
FROM
    #Project as p
INNER JOIN
    Lineage ON p.Name = Lineage.ParentNameToJoin
)
SELECT
    Name
   ,ParentNameToDisplay
FROM
    lineage

Best Answer

Related Solutions

Sql-server – using CASE with ISNULL

Sql-server – Optimizing removal of duplicate rows from recursive CTE query

Related Question