Sql-server – Does the SQL Server generate the contents for a derived table contents for every matching record in a join

determinismsql serversql-server-2008-r2

I am on SQL Server 2008 R2. I am executing the following SQL a

use tempdb
go
create table t1 ( c1 int, c2 int, c3 uniqueidentifier null)

go
insert t1(c1,c2) values (1,2),(1,3)
go
update t1
set c2 = 5 + c2,
    c3 = t2.c3
from t1 join ( select c1=1, c3=newid())   as t2 on t1.c1 = t2.c1
go
select * from t1
go
drop table t1
go

The result of the query is

c1          c2          c3
----------- ----------- ------------------------------------
1           7           D6BE2119-CECB-4413-94E4-6099E5CC9028
1           8           0BF8A3E2-4E66-4091-A1B0-4FB8E130B347

I was expecting the GUID values in c3 to be the same. It appears that derived table t2 is being "reevaluated" for each row in the table t1 even though it is not a co-related sub-query. Is this expected behavior ?

Best Answer

You would expect the NEWID() to be calculated once, for the derived table t2. This is not the case, as it is calculated for each row of the join resultset of T1 and T2.

Simply put, for each row where T1.C1 = T2.C1 or in other words where T1.C1=1.

Here is a part of a statement from Microsoft regarding this:

Closing the loop . . . I've discussed this question with the Dev team. And eventually we have decided not to change current behavior, for the following reasons:

1) The optimizer does not guarantee timing or number of executions of scalar functions. This is a long-established tenet. It's the fundamental 'leeway' that allows the optimizer enough freedom to gain significant improvements in query-plan execution....

As such the optimizer can choose the timing or number of executions, to get the best performance.

The source of the quote and some more information and examples and the response from microsoft

To explain it we best go through the query plan that is created for the UPDATE statement.

enter image description here

Right of the bat we don't see a join predicate, but simply a table scan. A predicate is added to that table scan, to only get the rows where C1 = 1

enter image description here

This is the 'join' reduced to a predicate.

After this, 2 rows are found that match this predicate, and are processed by the first Compute Scalar operator.

enter image description here

And NEWID() is calculated for each column of the joined table + derived table.

A workaround

The obvious fix is to use a variable to hold the NEWID() value as to calculate it before the join & reuse it.

create table t1 ( c1 int, c2 int, c3 uniqueidentifier null)

go
insert t1(c1,c2) values (1,2),(1,3)
go
DECLARE @test uniqueidentifier = newid()
update t1
set c2 = 5 + c2,
    c3 = t2.c3
from t1 join 
( select c1=1, c3=@test)   as t2 on t1.c1 = t2.c1
go
select * from t1
go
drop table t1
go

If you would have more than 1 source columns (T2) you would need to store the NEWID() values in a temporary table or table variable

Another interesting part:

So, we have a choice: if we want to guarantee a certain behavior in the presence of non-deterministic (side-effecting) code – so that results of JOINs, for example, follow the semantics of a nested-loop execution – then we can use appropriate OPTIONs to force that behavior – as UC points out. But the resulting code will run slow – that's the cost of, in effect, hobbling the Query Optimizer.

Source

By using functions we could force the optimizer to first calculate the NEWID() on the source table, before the join. But I am not able to do that on the query at hand (probably not the query's fault).