SQL Server – Misplacing ON Condition in JOIN Without Syntax Error

sql serversql-server-2012

I have executed the following CTE with joins

;with cte(asd) as(select 601
union all
select 602)
,cte2(fgh) as (select 601 union all
select 602 union all
select 603 union all
select 604 union all
select 605 union all
select 606 union all
select 607 union all
select 608 union all
select 609 union all
select 610 union all
select 611 union all
select 612)
select p.* from cte2 p 
inner join cte c
inner join cte2 c2 on c.asd=c2.fgh
on p.fgh=c.asd

the result of the above query was as follows:

The execution plan for the query was as follows:

I was happy with the result, now I cross checked my query which I have used above, identified that there was a misplaced near on condition.

The misplaced join was as follows:

inner join cte c
inner join cte2 c2 on c.asd=c2.fgh
on p.fgh=c.asd

As per my understanding there must be a on condition after inner join, but sql server didn't throw an error but executed and results were as expected.

After I changed the on condition and verify the both results and execution plan both were same.

;with cte(asd) as(select 601
union all
select 602)
,cte2(fgh) as (select 601 union all
select 602 union all
select 603 union all
select 604 union all
select 605 union all
select 606 union all
select 607 union all
select 608 union all
select 609 union all
select 610 union all
select 611 union all
select 612)
select p.* from cte2 p 
inner join cte c
on p.fgh=c.asd
inner join cte2 c2 on c.asd=c2.fgh

Can any one please explain why this thing happens, how sql server was executing?

Best Answer

This is from Itzik Ben-Gan's book Inside Microsoft SQL Server 2008: T-SQL Querying

I find this explanation (picture including) very helpfull

Related Solutions

SQL Server Join Syntax and Style Performance Considerations

In a world where the query optimizer considered all possible join orders, and contained all possible logical transformations, the syntax we use for our queries would not matter at all.

As it is, the optimizer generally uses heuristics to pick an initial join order and explores a number of join order rewrites from there. It does this to avoid excessive compilation time and resource usage. It doesn't take all that many joins for the number of possible combinations to become unreasonable to explore exhaustively.

To take an extreme example, 42 joins are enough to generate more alternatives than there are atoms in the observable universe. More realistically, even 7 tables are enough to produce 665,280 alternatives. Although this is not a mind-boggling number, it would still take very significant time (and memory) to explore those alternatives completely.

Although the heuristics are largely based on the type of join (inner, outer, cross...) and cardinality estimates, the textual order of the query can also have an impact. Sometimes, this is an optimizer limitation - NOT EXISTS clauses are not reordered, and outer join reordering is very limited. Even with simple inner joins, the interaction between textual order, initial join order heuristics, and optimizer internals can be difficult to predict with certainty.

To take an example using the AdventureWorks sample database, I can write a query using the a common syntax form as:

SELECT
    P.Name,
    PS.Name,
    SUM(TH.Quantity),
    SUM(INV.Quantity)
FROM Production.Product AS P
JOIN Production.ProductSubcategory AS PS
    ON PS.ProductSubcategoryID = P.ProductSubcategoryID
JOIN Production.TransactionHistory AS TH
    ON TH.ProductID = P.ProductID
JOIN Production.ProductInventory AS INV
    ON INV.ProductID = P.ProductID
GROUP BY
    P.ProductID,
    P.Name,
    PS.ProductSubcategoryID,
    PS.Name;

Before cost-based optimization, the logical query tree looks like this (note the join order is not the same as the written order):

Logical tree

I can (carefully) rewrite the query to use 'nested' syntax:

SELECT
    P.Name,
    PS.Name,
    SUM(TH.Quantity),
    SUM(INV.Quantity)
FROM Production.ProductSubcategory AS PS
JOIN Production.Product AS P
JOIN Production.TransactionHistory AS TH
JOIN Production.ProductInventory AS INV
    ON INV.ProductID = TH.ProductID
    ON TH.ProductID = P.ProductID
    ON P.ProductSubcategoryID = PS.ProductSubcategoryID
GROUP BY
    P.ProductID,
    P.Name,
    PS.ProductSubcategoryID,
    PS.Name;

In which case the logical tree at the same point is:

Input tree 2

The two different syntaxes produce a different initial join order in this case. After cost-based optimization, both produce the same output plan shape:

Plan shape

There are detailed differences between the two plans, with the 'nested' syntax producing a plan with a somewhat lower estimated cost:

Plan 2

The two inputs took a slightly different path through the optimizer, so it isn't all that surprising there are slight differences.

In general, using different syntax will sometimes (definitely not always!) produce different plan results. There is no broad correlation between one syntax and better plans. Most people write and maintain queries using something like the non-nested join syntax, so it often makes practical sense to use that.

To summarize, my advice is to write queries using whichever syntax seems most natural (and maintainable!) to you and your peers. If you get a better plan for a specific query using a particular syntax, by all means use it - but be sure to test that you still get the better plan whenever you patch or upgrade SQL Server :)

SQL Server – Syntax of INNER JOIN Nested Inside OUTER JOIN

If you look at the 2 execution plans, is there an easy answer to which is better? I purposefully did NOT create indexes so it's easier to see what's happening.

The second plan has a lower estimated cost, so in that limited sense it is 'better'.

The data sets are so small that the optimizer did not spend much time looking at alternatives. The first form of the query happens to find a plan using hash join and a table spool early on. The estimated cost of that plan is so low that the optimizer does not bother looking for anything better.

The second form of the query happens to find a plan using only nested loops outer joins early in the search process, and again the optimizer decides that plan is good enough. It so happens that this plan is estimated to be cheaper.

That said (as mentioned in the question comments) the two queries are not semantically identical. This may not be important to you if you can guarantee that the results will always be the same for all possible future states of your database, but the optimizer cannot make that assumption. It only ever produces plans that are guaranteed to produce the same results specified by the SQL, in all circumstances.

I have realized that the nested syntax also modifies the behaviour of the query.

The 'nested syntax' is just one aspect of the whole ANSI join syntax specification. To enable a full logical specification for more complex join patterns, the specification allows (optional) parentheses, and FROM clause subqueries.

The query can be written using the same ANSI syntax using parentheses:

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Autos AS A
LEFT JOIN
(
    dbo.Manufacturers AS N
    JOIN dbo.Models AS M
        ON M.ManufacturerID = N.ManufacturerID
) ON M.ModelID = A.ModelID;

This form clearly shows that the logical requirement is to left join from Autos to the result of inner joining Manufacturers to Models. Omitting the optional parentheses gives the form you call 'nested':

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Autos AS A
LEFT JOIN dbo.Manufacturers AS N
JOIN dbo.Models AS M
    ON M.ManufacturerID = N.ManufacturerID
    ON M.ModelID = A.ModelID;

This is not a different syntax - it is just omitting optional parentheses and reformatting a bit.

As Martin mentioned, it is also possible in this case to express the logical requirement using inner joins followed by a right outer join:

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Manufacturers AS N
JOIN dbo.Models AS M
    ON M.ManufacturerID = N.ManufacturerID
RIGHT JOIN dbo.Autos AS A
    ON A.ModelID = M.ModelID;

All three query forms above use the same ANSI join syntax. All three also happen to produce the same physical execution plan with the data set provided:

Common execution plan

As I mentioned in my answer to your previous question, queries that express exactly the same logical requirement will not always produce the same execution plan. Which logical query form you prefer to use is largely a question of style. There is no correlation between one particular style and 'better' query plans in general. I would generally advise against rewriting a query to get a particular plan if the new query is not genuinely logically identical to the original.

The SQL standard also allows FROM clause subqueries, so yet another way to write the same query specification is:

SELECT * 
FROM dbo.Autos AS A
LEFT JOIN
(
    SELECT
        N.ManufacturerID,
        ManufacturerName = N.Name,
        M.ModelID,
        ModelName = M.Name
    FROM dbo.Manufacturers AS N
    JOIN dbo.Models AS M
        ON M.ManufacturerID = N.ManufacturerID
) AS R1
    ON R1.ModelID = A.ModelID;

Using the traditional syntax, we have to change the join to `Manufacturers to an outer join, like so... but this changes the query plan.

This probably changes the meaning of the query, in which case it is technically not a valid alternative (but see ypercube's comment on your question).

The (optional) parentheses in the ANSI join syntax are there precisely for more complex join requirements like this, so you should not be afraid to use them where necessary.

Best Answer

Related Solutions

SQL Server Join Syntax and Style Performance Considerations

SQL Server – Syntax of INNER JOIN Nested Inside OUTER JOIN

Related Question