Sql-server – join syntax / style performance consideration

sql serversql-server-2012

We recently found on one of our stored procedures that we get significant performance improvement by changing the join syntax/style of our query from this…

SELECT b.bla, c.foo, d.bar
FROM dbo.TableB b
JOIN dbo.TableC c
    JOIN dbo.TableD d  -- <-- Nested join syntax
    ON d.yyy = c.yyy
ON c.xxx = b.xxx

To this…

SELECT b.bla, c.foo, d.bar
FROM dbo.TableB b
JOIN dbo.TableC c
ON c.xxx = b.xxx
JOIN dbo.TableD d   -- <-- Regular way
ON d.yyy = c.yyy

Note: in the real query, there were 10 joined tables including inner and outer joins. The tables were not huge in terms of sql data. No aggregates. There was a DISTINCT on the output. All joins were to a primary key, but the foreign key was not necessarily indexed.

We will certainly change our ways, but I'm still curious as to the proper 'guidance' on such style. I've often used the 'indented' style to indicate a 'more readable' join for such things as lookup tables.

Best Answer

In a world where the query optimizer considered all possible join orders, and contained all possible logical transformations, the syntax we use for our queries would not matter at all.

As it is, the optimizer generally uses heuristics to pick an initial join order and explores a number of join order rewrites from there. It does this to avoid excessive compilation time and resource usage. It doesn't take all that many joins for the number of possible combinations to become unreasonable to explore exhaustively.

To take an extreme example, 42 joins are enough to generate more alternatives than there are atoms in the observable universe. More realistically, even 7 tables are enough to produce 665,280 alternatives. Although this is not a mind-boggling number, it would still take very significant time (and memory) to explore those alternatives completely.

Although the heuristics are largely based on the type of join (inner, outer, cross...) and cardinality estimates, the textual order of the query can also have an impact. Sometimes, this is an optimizer limitation - NOT EXISTS clauses are not reordered, and outer join reordering is very limited. Even with simple inner joins, the interaction between textual order, initial join order heuristics, and optimizer internals can be difficult to predict with certainty.

To take an example using the AdventureWorks sample database, I can write a query using the a common syntax form as:

SELECT
    P.Name,
    PS.Name,
    SUM(TH.Quantity),
    SUM(INV.Quantity)
FROM Production.Product AS P
JOIN Production.ProductSubcategory AS PS
    ON PS.ProductSubcategoryID = P.ProductSubcategoryID
JOIN Production.TransactionHistory AS TH
    ON TH.ProductID = P.ProductID
JOIN Production.ProductInventory AS INV
    ON INV.ProductID = P.ProductID
GROUP BY
    P.ProductID,
    P.Name,
    PS.ProductSubcategoryID,
    PS.Name;

Before cost-based optimization, the logical query tree looks like this (note the join order is not the same as the written order):

Logical tree

I can (carefully) rewrite the query to use 'nested' syntax:

SELECT
    P.Name,
    PS.Name,
    SUM(TH.Quantity),
    SUM(INV.Quantity)
FROM Production.ProductSubcategory AS PS
JOIN Production.Product AS P
JOIN Production.TransactionHistory AS TH
JOIN Production.ProductInventory AS INV
    ON INV.ProductID = TH.ProductID
    ON TH.ProductID = P.ProductID
    ON P.ProductSubcategoryID = PS.ProductSubcategoryID
GROUP BY
    P.ProductID,
    P.Name,
    PS.ProductSubcategoryID,
    PS.Name;

In which case the logical tree at the same point is:

Input tree 2

The two different syntaxes produce a different initial join order in this case. After cost-based optimization, both produce the same output plan shape:

Plan shape

There are detailed differences between the two plans, with the 'nested' syntax producing a plan with a somewhat lower estimated cost:

Plan 2

The two inputs took a slightly different path through the optimizer, so it isn't all that surprising there are slight differences.

In general, using different syntax will sometimes (definitely not always!) produce different plan results. There is no broad correlation between one syntax and better plans. Most people write and maintain queries using something like the non-nested join syntax, so it often makes practical sense to use that.

To summarize, my advice is to write queries using whichever syntax seems most natural (and maintainable!) to you and your peers. If you get a better plan for a specific query using a particular syntax, by all means use it - but be sure to test that you still get the better plan whenever you patch or upgrade SQL Server :)