Sql-server – Inner join elimination inhibited by prior outer join

optimizationsql server

Synopsis: Inner joins that can be logically eliminated are instead retained if there is an non-eliminated outer join earlier in the logical tree. Why?

Examples run in AdventureWorks2008R2 and later. I have added traceflags to give the overall context of successive trees and rules.

First example, for context:

The left join to Product is eliminated during simplification (no data is required from the joined table and the referenced values are unique).
The inner join to SalesOrderDetail is then eliminated during join collapse aka Heuristic Join Reorder (no data is required from the joined table, the referrer is non nullable, and has an FK enforced)

SELECT sod.SalesOrderDetailID
FROM Sales.SalesOrderDetail AS sod
    LEFT JOIN Production.Product AS p -- Eliminated during simplification (Rule: RedundantLOJN)
        ON p.ProductID = sod.ProductID
    JOIN Sales.SalesOrderHeader AS soh -- Eliminated during join collapse. (Annotated by TF 8619)
        ON soh.SalesOrderID = sod.SalesOrderID
OPTION (RECOMPILE, QUERYTRACEON 8619, QUERYTRACEON 8621, QUERYTRACEON 8606, QUERYTRACEON 3604);

In this second example however, the join to SalesOrderHeader could logically be eliminated, but isn't.

The left join is retained because data is required from Product. In the logical trees this join is defined as being prior to the join that does not eliminate.
The subsequent join to SalesOrderHeader could logically be eliminated, because the prior join can not invalidate the elimination requirement: not null referrer + FK integrity.

SELECT p.Name
FROM Sales.SalesOrderDetail AS sod
    LEFT JOIN Production.Product AS p
        ON p.ProductID = sod.ProductID
    JOIN Sales.SalesOrderHeader AS soh -- Logically eligible for elimination.
        ON soh.SalesOrderID = sod.SalesOrderID
OPTION (RECOMPILE, QUERYTRACEON 8619, QUERYTRACEON 8621, QUERYTRACEON 8606, QUERYTRACEON 3604);

Finally, three variants where the join is successfully eliminated.

In the query text, placing the outer join after the problem join changes the logical tree. The logical meaning is unchanged, but the inner join no longer has the outer join as a descendent in the logical tree.

NOTE! A rare example of where, in SQL Server, the order of the join statements in the query affects the query plan

SELECT p.Name
FROM Sales.SalesOrderDetail AS sod
    JOIN Sales.SalesOrderHeader AS soh -- Eliminated during join collapse. (Annotated by TF 8619)
        ON soh.SalesOrderID = sod.SalesOrderID
    LEFT JOIN Production.Product AS p
        ON p.ProductID = sod.ProductID
OPTION (RECOMPILE, QUERYTRACEON 8619, QUERYTRACEON 8621, QUERYTRACEON 8606, QUERYTRACEON 3604);

If the first join is changed to inner, the second join is successfully eliminated.

SELECT p.Name
FROM Sales.SalesOrderDetail AS sod
    JOIN Production.Product AS p
        ON p.ProductID = sod.ProductID
    JOIN Sales.SalesOrderHeader AS soh -- Eliminated during join collapse. (Annotated by TF 8619)
        ON soh.SalesOrderID = sod.SalesOrderID
OPTION (RECOMPILE, QUERYTRACEON 8619, QUERYTRACEON 8621, QUERYTRACEON 8606, QUERYTRACEON 3604);

Also, as a solution, we can instead change the second join to outer:

SELECT p.Name
FROM Sales.SalesOrderDetail AS sod
    LEFT JOIN Production.Product AS p
        ON p.ProductID = sod.ProductID
    LEFT JOIN Sales.SalesOrderHeader AS soh -- Eliminated during simplification (Rule: RedundantLOJN)
        ON soh.SalesOrderID = sod.SalesOrderID
OPTION (RECOMPILE, QUERYTRACEON 8621, QUERYTRACEON 8606, QUERYTRACEON 3604);

Conclusion

The above examples appear to demonstrate that an outer join may prevent a subsequent inner join elimination, despite it being logically possible.

My speculation is that properties that facilitate the inner join elimination (non null referrer, FK integrity) are not propagated up to the properties of the output of the outer join operator.

Can anyone confirm what the actual cause is?

The take away here is that if you create multi-purpose views that leverage join elimination for optimal plans, you need to be aware of this interaction, and potentially amend joins to avoid unnecessary work during execution.

Best Answer

Many of the simplifications performed before cost-based optimization are targeted at generated queries (ORMs and the like). These queries often follow a pattern and result in logically redundant projections, selections, and joins.

There is a trade-off to be made here. Any number of rewrites and simplifications are logically possible. Each of these will need to be assessed against the current tree, and applied if the local circumstances are suitable. All this takes time and resources. Rules run before cost-based optimization are considered for every query, even ones with very little unoptimized cost, or which will qualify later for a trivial plan.

For those reasons, the optimizer team were careful to include here only rules with a relatively low cost (implementation and runtime), and high applicability.

Consider: Some rules are more difficult to implement than others. Some are more costly to evaluate than is justified by the potential gains. Some would introduce subtle bugs elsewhere in the optimizer code due to internal dependencies. Others are simply not common enough to make implementing them worthwhile. Still others would be easy to implement, would be commonly-enough useful, but weren't thought of at the time, and haven't been requested (loudly enough) since. For example, join elimination with multi-column relationships.

An example relevant to your question, using the same schema:

-- Join eliminated
SELECT SOD.ProductID 
FROM Sales.SalesOrderDetail AS SOD
LEFT JOIN Production.Product AS P
    ON P.ProductID = SOD.ProductID;

-- Join not eliminated projecting from the preserved side of the join
SELECT P.ProductID 
FROM Sales.SalesOrderDetail AS SOD
LEFT JOIN Production.Product AS P
    ON P.ProductID = SOD.ProductID;

The join is not eliminated there, though we might argue P.ProductID and SOD.ProductID are guaranteed identical in all respects by the logic and schema. More to the current point, the outer join in the second query is not converted to an inner join, which would allow the simplification targeted by the question.

Again, this is not because the SQL Server optimizer developers were stupid or lazy. This sort of thing just isn't common enough to be worthwhile checking for on every compilation.

In general, to get the best out of join simplification and elimination, you should construct written joins in a logical order (e.g. joined tables adjacent) and ensure the four conditions noted by Rob Farley are met.

Reordering joins

It is possible, but often complex and expensive, to move outer joins around other joins in some limited contexts. These transformations are tricky, so the vast majority of this type of effort is limited to the search 2 (full optimization) stage of cost-based optimization. Even so, relatively few of the logical possibilities here have been researched and/or implemented in SQL Server.

It is all too easy to change semantics unintentionally during transforms of this kind. For some introductory discussion see Be Careful When Mixing INNER and OUTER Joins by Jeff Smith. For more technical details, there are a wide range of technical papers, for example Outerjoin Simplification and Reordering for Query Optimization by César A. Galindo-Legaria (Microsoft) and Arnon Rosenthal.

Heuristic join reorder does make some efforts to reorganize cross joins, inner joins, and outer joins, but these efforts are very much at the lightweight end of the spectrum for all the reasons previously noted.

I'll leave you with this fun rewrite that does allow elimination:

SELECT p.[Name]
FROM Production.Product AS P
RIGHT JOIN Sales.SalesOrderDetail AS SOD
JOIN Sales.SalesOrderHeader AS SOH
    ON SOH.SalesOrderID = SOD.SalesOrderID
    ON SOD.ProductID = P.ProductID;

db<>fiddle demo

As Lennart mentioned:

You may find some interest in the following articles: https://dzone.com/articles/cool-sql-optimizations-that-do-not-depend-on-the-c and https://dzone.com/articles/cool-sql-optimizations-that-do-not-depend-on-the-c-1 It compares a number of DBMS (sql-server-2014 among others) for "algebraic" optimizations that do not rely on the cost-model.

Those are mostly accurate for SQL Server, with the exception of 4. Removing “Silly” Predicates, which doesn't reflect that SQL Server differentiates between EQ (equal, null-rejecting) and IS (null-aware) comparisons. To be clear, SQL Server does support this.

Related Solutions

Sql-server – Syntax of INNER JOIN nested inside OUTER JOIN vs. query results

If you look at the 2 execution plans, is there an easy answer to which is better? I purposefully did NOT create indexes so it's easier to see what's happening.

The second plan has a lower estimated cost, so in that limited sense it is 'better'.

The data sets are so small that the optimizer did not spend much time looking at alternatives. The first form of the query happens to find a plan using hash join and a table spool early on. The estimated cost of that plan is so low that the optimizer does not bother looking for anything better.

The second form of the query happens to find a plan using only nested loops outer joins early in the search process, and again the optimizer decides that plan is good enough. It so happens that this plan is estimated to be cheaper.

That said (as mentioned in the question comments) the two queries are not semantically identical. This may not be important to you if you can guarantee that the results will always be the same for all possible future states of your database, but the optimizer cannot make that assumption. It only ever produces plans that are guaranteed to produce the same results specified by the SQL, in all circumstances.

I have realized that the nested syntax also modifies the behaviour of the query.

The 'nested syntax' is just one aspect of the whole ANSI join syntax specification. To enable a full logical specification for more complex join patterns, the specification allows (optional) parentheses, and FROM clause subqueries.

The query can be written using the same ANSI syntax using parentheses:

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Autos AS A
LEFT JOIN
(
    dbo.Manufacturers AS N
    JOIN dbo.Models AS M
        ON M.ManufacturerID = N.ManufacturerID
) ON M.ModelID = A.ModelID;

This form clearly shows that the logical requirement is to left join from Autos to the result of inner joining Manufacturers to Models. Omitting the optional parentheses gives the form you call 'nested':

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Autos AS A
LEFT JOIN dbo.Manufacturers AS N
JOIN dbo.Models AS M
    ON M.ManufacturerID = N.ManufacturerID
    ON M.ModelID = A.ModelID;

This is not a different syntax - it is just omitting optional parentheses and reformatting a bit.

As Martin mentioned, it is also possible in this case to express the logical requirement using inner joins followed by a right outer join:

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Manufacturers AS N
JOIN dbo.Models AS M
    ON M.ManufacturerID = N.ManufacturerID
RIGHT JOIN dbo.Autos AS A
    ON A.ModelID = M.ModelID;

All three query forms above use the same ANSI join syntax. All three also happen to produce the same physical execution plan with the data set provided:

Common execution plan

As I mentioned in my answer to your previous question, queries that express exactly the same logical requirement will not always produce the same execution plan. Which logical query form you prefer to use is largely a question of style. There is no correlation between one particular style and 'better' query plans in general. I would generally advise against rewriting a query to get a particular plan if the new query is not genuinely logically identical to the original.

The SQL standard also allows FROM clause subqueries, so yet another way to write the same query specification is:

SELECT * 
FROM dbo.Autos AS A
LEFT JOIN
(
    SELECT
        N.ManufacturerID,
        ManufacturerName = N.Name,
        M.ModelID,
        ModelName = M.Name
    FROM dbo.Manufacturers AS N
    JOIN dbo.Models AS M
        ON M.ManufacturerID = N.ManufacturerID
) AS R1
    ON R1.ModelID = A.ModelID;

Using the traditional syntax, we have to change the join to `Manufacturers to an outer join, like so... but this changes the query plan.

This probably changes the meaning of the query, in which case it is technically not a valid alternative (but see ypercube's comment on your question).

The (optional) parentheses in the ANSI join syntax are there precisely for more complex join requirements like this, so you should not be afraid to use them where necessary.

TheSQL efficiency with inner join queries

SELECT record_id from `table_a`
where customer_id="654"
and record_id in
    (SELECT cat_id from `table_b` where cat_id="654");

The meaning of a JOIN is the AND of the meanings of its arguments. ON and WHERE both AND in a conditon. You want rows where (using obvious aliases):
customer [a.customer_id] ...
AND customer [b.cat_id] ...
AND [a.customer_id] = 654 AND [b.cat_id] = 654 AND [a.record_id] = [b.cat_id]

SELECT a.record_id
FROM `table_a` a JOIN `table_b` b
WHERE a.customer_id = 654 AND b.cat_id = 654
AND a.record_id = b.cat_id

(In standard SQL ( INNER ) JOIN needs an ON. So you could CROSS JOIN or replace WHERE by ON.)

As a comment says, MySQL has historically not been very good at optimizing. But it is constantly improving. IN has been notoriously slow, even when it is equivalent to other more optimized expressions. You may get better performance by explicitly equating the ids first in an ON:

SELECT a.record_id
FROM `table_a` a JOIN `table_b` b
ON a.record_id = b.cat_id
AND a.customer_id = 654 AND b.cat_id = 654

Declare each of those fields as PK or UNIQUE NOT NULL if it is (which implicitly adds an index), otherwise add an index on it. MySQL unadorned KEY is a synonym for INDEX when not in a column declaration, which does not tell the database that a column set is unique. Yes, uniqueness affects performance, so give your tables their proper rows and declare and enforce any uniqueness by PRIMARY KEY or UNIQUE.

If you make the ids INT then the DBMS only needs to go to the index, not the data.

(Also, read the documentation re keys, indices and optimization. Use EXPLAIN.)

Best Answer

Reordering joins

Related Solutions

Sql-server – Syntax of INNER JOIN nested inside OUTER JOIN vs. query results

TheSQL efficiency with inner join queries

Related Question