Sql-server – How does SQL recursion actually work

cterecursivesql server

Coming to SQL from other programming languages, the structure of a recursive query looks rather odd. Walk through it step by step, and it seems to fall apart.

Consider the following simple example:

CREATE TABLE #NUMS
(N BIGINT);

INSERT INTO #NUMS
VALUES (3), (5), (7);

WITH R AS
(
    SELECT N FROM #NUMS
    UNION ALL
    SELECT N*N AS N FROM R WHERE N*N < 10000000
)
SELECT N FROM R ORDER BY N;

Let's walk through it.

First, the anchor member executes and the result set is put into R. So R is initialized to {3, 5, 7}.

Then, execution drops below the UNION ALL and the recursive member is executed for the first time. It executes on R (that is, on the R that we currently have in hand: {3, 5, 7}). This results in {9, 25, 49}.

What does it do with this new result? Does it append {9, 25, 49} to the existing {3, 5, 7}, label the resulting union R, and then carry on with the recursion from there? Or does it redefine R to be only this new result {9, 25, 49} and do all the union-ing later?

Neither choice makes sense.

If R is now {3, 5, 7, 9, 25, 49} and we execute the next iteration of the recursion, then we will end up with {9, 25, 49, 81, 625, 2401} and we've lost {3, 5, 7}.

If R is now only {9, 25, 49}, then we have a mislabeling problem. R is understood to be the union of the anchor member result set and all the subsequent recursive member result sets. Whereas {9, 25, 49} is only a component of R. It is not the full R that we have accrued so far. Therefore, to write the recursive member as selecting from R makes no sense.

I certainly appreciate what @Max Vernon and @Michael S. have detailed below. Namely, that (1) all the components are created up to the recursion limit or null set, and then (2) all the components are unioned together. This is how I understand SQL recursion to actually work.

If we were redesigning SQL, maybe we would enforce a more clear and explicit syntax, something like this:

WITH R AS
(
    SELECT   N
    INTO     R[0]
    FROM     #NUMS
    UNION ALL
    SELECT   N*N AS N
    INTO     R[K+1]
    FROM     R[K]
    WHERE    N*N < 10000000
)
SELECT N FROM R ORDER BY N;

Sort of like an inductive proof in mathematics.

The problem with SQL recursion as it currently stands is that it is written in a confusing way. The way it is written says that each component is formed by selecting from R, but it does not mean the full R that has been (or, appears to have been) constructed so far. It just means the previous component.

Best Answer

The BOL description of recursive CTEs describes the semantics of recursive execution as being as follows:

Split the CTE expression into anchor and recursive members.
Run the anchor member(s) creating the first invocation or base result set (T0).
Run the recursive member(s) with Ti as an input and Ti+1 as an output.
Repeat step 3 until an empty set is returned.
Return the result set. This is a UNION ALL of T0 to Tn.

So each level only has as input the level above not the entire result set accumulated so far.

The above is how it works logically. Physically recursive CTEs are currently always implemented with nested loops and a stack spool in SQL Server. This is described here and here and means that in practice each recursive element is just working with the parent row from the previous level, not the whole level. But the various restrictions on allowable syntax in recursive CTEs mean this approach works.

If you remove the ORDER BY from your query the results are ordered as follows

+---------+
|    N    |
+---------+
|       3 |
|       5 |
|       7 |
|      49 |
|    2401 |
| 5764801 |
|      25 |
|     625 |
|  390625 |
|       9 |
|      81 |
|    6561 |
+---------+

This is because the execution plan operates very similarly to the following C#

using System;
using System.Collections.Generic;
using System.Diagnostics;

public class Program
{
    private static readonly Stack<dynamic> StackSpool = new Stack<dynamic>();

    private static void Main(string[] args)
    {
        //temp table #NUMS
        var nums = new[] { 3, 5, 7 };

        //Anchor member
        foreach (var number in nums)
            AddToStackSpoolAndEmit(number, 0);

        //Recursive part
        ProcessStackSpool();

        Console.WriteLine("Finished");
        Console.ReadLine();
    }

    private static void AddToStackSpoolAndEmit(long number, int recursionLevel)
    {
        StackSpool.Push(new { N = number, RecursionLevel = recursionLevel });
        Console.WriteLine(number);
    }

    private static void ProcessStackSpool()
    {
        //recursion base case
        if (StackSpool.Count == 0)
            return;

        var row = StackSpool.Pop();

        int thisLevel = row.RecursionLevel + 1;
        long thisN = row.N * row.N;

        Debug.Assert(thisLevel <= 100, "max recursion level exceeded");

        if (thisN < 10000000)
            AddToStackSpoolAndEmit(thisN, thisLevel);

        ProcessStackSpool();
    }
}

NB1: As above by the time the first child of anchor member 3 is being processed all information about its siblings, 5 and 7, and their descendants, has already been discarded from the spool and is no longer accessible.

NB2: The C# above has the same overall semantics as the execution plan but the flow in the execution plan is not identical, as there the operators work in a pipelined exection fashion. This is a simplified example to demonstrate the gist of the approach. See the earlier links for more details on the plan itself.

NB3: The stack spool itself is apparently implemented as a non unique clustered index with key column of recursion level and uniqueifiers added as needed (source)

Related Solutions

Sql-server – queries against a linked server actually processed

I don't think this is empirically deterministic. My understanding is that an entire table could be brought over and processed on the calling server instead of being processed remotely on the linked server. If both servers are SQL Server 2008+ you should see scenarios like this:

SELECT x.foo FROM linked.db.dbo.x JOIN linked.db.dbo.y ON ...

And in these cases you will see that the join, filters etc. are processed on the remote server. But this should not be construed as a blanket statement! You may see slightly different remoting decisions in the plan for a slightly different scenario:

SELECT x.foo FROM localdb.dbo.x JOIN linked.db.dbo.y ON ...

These are for SQL Server 2008+ -> SQL Server 2008+ linked servers. Other linked servers (lower versions of SQL Server, other platforms altogether) may behave differently, and it may be version-dependent, it could even be specific driver-version dependent, or query-dependent (e.g. is a function used). And it of course could depend on the permissions of the linked server account - if you can't see remote statistics / cardinality for the distributed portion of the query, the local optimizer is going to be pretty blind and may not make the decision you want.

If you want to guarantee that processing occurs on the remote server before it starts streaming data across, and you have actually tried to get it work with four-part names and the plan does not try to process remotely, you can use OPENQUERY instead of explicitly referencing four-part names. Never tried that in a view, though. :-)

If you come across a case where you expect processing to occur remotely and it's not, I'm more than happy to look at the actual plan and the stats profile to help understand why (or at the very least to pass it on to some of my connections who can help).

From my own personal experience, I've found much, much, much more reliable and predictable performance by doing whatever I can to make as many queries as possible entirely local. Linked servers are useful but they certainly don't lend themselves to real-time performance. While you may think it violates DRY in some sense, I would rather pay that cost once during a mirroring session or replication and have redundant data, than potentially pay for a sub-optimal remoting decision for every single cross-server query. And if the data can be slightly delayed there are even more options for moving data across to make it available locally (log shipping, SSIS, custom).

Sql-server – Parent-child Relationship with data scattered between these records

so after some more thought and reading some of the posts i broke up my query into two parts, i think that was the main confusion because there were so many moving parts

I created a view that performs the recursion on the target table (see below), thereby removing the need for the other columns
I used this table in my 60 column query.

Here is the recursion that ended up working. Part of my problem was that I needed that parent and all children in the final result. I set the OPTION(MAXRECURSION 676) because this is the most that can be done after further research

WITH recursion AS (

    SELECT 
        CardNumber
        ,CardNumber AS ParentSerialNumber
    FROM 
        dbo.Card

    UNION ALL

    SELECT 
      p.CardNumber
      ,c.ParentSerialNumber
    FROM 
        recursion AS c
        INNER JOIN  dbo.Card AS p
            ON p.ParentSerialNumber = c.SerialNumber
)

SELECT 
        SerialNumber
        ,ParentSerialNumber
FROM
    recursion

Best Answer

Related Solutions

Sql-server – queries against a linked server actually processed

Sql-server – Parent-child Relationship with data scattered between these records

Related Question