SQL Server 2014 – Optimize Self-Join on Primary Key

optimizationsql serversql server 2014

Consider this query which consists of N self-joins:

select
    t1.*
from [Table] as t1
join [Table] as t2 on
    t1.Id = t2.Id
-- ...
join [Table] as tN on
    t1.Id = tN.Id

It produces an execution plan with N clustered index scans and N-1 merge joins.

Honestly, I don't see any reasons to not optimize away all joins and do just one clustered index scan, i.e. optimize original query to this:

select
    t1.*
from [Table] as t1

Questions

Why joins aren't optimized away?
Is it mathematically incorrect to say that every join doesn't change result set?

Tested on:

Source Server Version : SQL Server 2014 (12.0.4213)
Source Database Engine Edition : Microsoft SQL Server Standard Edition
Source Database Engine Type : Standalone SQL Server
Compatibility level : SQL Server 2008 (100)

The query isn't meaningful; it just came to my mind and I'm curious about it now.

Here's the fiddle with table creation and 3 queries: with inner join's, with left join's and mixed. You can also look at execution plan there, too.

It seems that left joins are eliminated in the result execution plan while inner joins are not. Still don't get why, though.

Best Answer

First, lets assume that (id) is the primary key of the table. In this case, yes, the joins are (can be proved) redundant and could be eliminated.

Now that's just theory - or mathematics. In order for the optimizer to do an actual elimination, the theory has to have been converted into code and added in the optimizer's suite of optimizations/rewritings/eliminations. For that to happen, the (DBMS) developers must think that it will have good benefits to efficiency and that it's a common enough case.

Personally, it doesn't sound like one (common enough). The query - as you admit - looks rather silly and a reviewer shouldn't let it pass review, unless it was improved and the redundant join removed.

That said, there are similar queries where the elimination does happen. There is a very nice related blog post by Rob Farley: JOIN simplification in SQL Server.

In our case, all we have to do in change the joins to LEFT joins. See dbfiddle.uk. The optimizer in this case knows that the join can be safely removed without possibly changing the results. (The simplification logic is quite general and is not special-cased for self-joins.)

In the original query of course, removing the INNER joins cannot possibly change the results either. But it's not common at all to self-join on the primary key so the optimizer does not have this case implemented. It's common however to join (or left join) where joined column is the primary key of one of the tables (and there is often a foreign key constraint). Which leads to a second option to eliminate the joins: Add a (self referencing!) foreign key constraint:

ALTER TABLE "Table"
    ADD FOREIGN KEY (id) REFERENCES "Table" (id) ;

And voila, the joins are eliminated! (tested in the same fiddle): here

create table docs
(id int identity primary key,
 doc varchar(64)
) ;
GO

✓

insert
into docs (doc)
values ('Enter one batch per field, don''t use ''GO''')
     , ('Fields grow as you type')
     , ('Use the [+] buttons to add more')
     , ('See examples below for advanced usage')
  ;
GO

4 rows affected

--------------------------------------------------------------------------------
-- Or use XML to see the visual representation, thanks to Justin Pealing and
-- his library: https://github.com/JustinPealing/html-query-plan
--------------------------------------------------------------------------------
set statistics xml on;
select d1.* from docs d1 
    join docs d2 on d2.id=d1.id
    join docs d3 on d3.id=d1.id
    join docs d4 on d4.id=d1.id;
set statistics xml off;
GO

id | doc                                      
-: | :----------------------------------------
 1 | Enter one batch per field, don't use 'GO'
 2 | Fields grow as you type                  
 3 | Use the [+] buttons to add more          
 4 | See examples below for advanced usage

--------------------------------------------------------------------------------
-- Or use XML to see the visual representation, thanks to Justin Pealing and
-- his library: https://github.com/JustinPealing/html-query-plan
--------------------------------------------------------------------------------
set statistics xml on;
select d1.* from docs d1 
    left join docs d2 on d2.id=d1.id
    left join docs d3 on d3.id=d1.id
    left join docs d4 on d4.id=d1.id;
set statistics xml off;
GO

id | doc                                      
-: | :----------------------------------------
 1 | Enter one batch per field, don't use 'GO'
 2 | Fields grow as you type                  
 3 | Use the [+] buttons to add more          
 4 | See examples below for advanced usage

alter table docs
  add foreign key (id) references docs (id) ;
GO

✓

--------------------------------------------------------------------------------
-- Or use XML to see the visual representation, thanks to Justin Pealing and
-- his library: https://github.com/JustinPealing/html-query-plan
--------------------------------------------------------------------------------
set statistics xml on;
select d1.* from docs d1 
    join docs d2 on d2.id=d1.id
    join docs d3 on d3.id=d1.id
    join docs d4 on d4.id=d1.id;
set statistics xml off;
GO

id | doc                                      
-: | :----------------------------------------
 1 | Enter one batch per field, don't use 'GO'
 2 | Fields grow as you type                  
 3 | Use the [+] buttons to add more          
 4 | See examples below for advanced usage

Query Plan Analysis

The query you have now is:

UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
    ON P.tempId = H.tempId
    AND P.n = H.n;

This generates the rather inefficient plan:

Default plan

The main problems in this plan are the hash join and sort. Both require a memory grant (the hash join needs to build a hash table, and the sort needs room to store the rows while sorting progresses). Plan Explorer shows this query was granted 765 MB:

Memory Grant

This is quite a lot of server memory to dedicate to one query! More to the point, this memory grant is fixed before execution starts based on row count and size estimates.

If the memory turns out to be insufficient at execution time, at least some data for the hash and/or sort will be written to physical tempdb disk. This is known as a 'spill' and it can be a very slow operation. You can trace these spills (in SQL Server 2008) using the Profiler events Hash Warnings and Sort Warnings.

The estimate for the hash table's build input is very good:

Hash Input

The estimate for the sort input is less accurate:

Sort Input

You would have to use Profiler to check, but I suspect the sort will spill to tempdb in this case. It is also possible that the hash table spills too, but that is less clear-cut.

Note that the memory reserved for this query is split between the hash table and sort, because they run concurrently. The Memory Fractions plan property shows the relative amount of the memory grant expected to be used by each operation.

Why Sort and Hash?

The sort is introduced by the query optimizer to ensure that rows arrive at the Clustered Index Update operator in clustered key order. This promotes sequential access to the table, which is often much more efficient than random access.

The hash join is a less obvious choice, because it's inputs are similar sizes (to a first approximation, anyway). Hash join is best where one input (the one that builds the hash table) is relatively small.

In this case, the optimizer's costing model determines that hash join is the cheaper of the three options (hash, merge, nested loops).

Improving Performance

The cost model does not always get it right. It tends to over-estimate the cost of parallel merge join, especially as the number of threads increases. We can force a merge join with a query hint:

UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
    ON P.tempId = H.tempId
    AND P.n = H.n
OPTION (MERGE JOIN);

This produces a plan that does not require as much memory (because merge join does not need a hash table):

Merge Plan

The problematic sort is still there, because merge join only preserves the order of its join keys (tempId, n) but the clustered keys are (tempId, n, sporder). You may find the merge join plan performs no better than the hash join plan.

Nested Loops Join

We can also try a nested loops join:

UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
    ON P.tempId = H.tempId
    AND P.n = H.n
OPTION (LOOP JOIN);

The plan for this query is:

Serial Nested Loops Plan

This query plan is considered the worst by the optimizer's costing model, but it does have some very desirable features. First, nested loops join does not require a memory grant. Second, it can preserve the key order from the Persons table so that an explicit sort is not needed. You may find this plan performs relatively well, perhaps even good enough.

Parallel Nested Loops

The big drawback with the nested loops plan is that it runs on a single thread. It is likely this query benefits from parallelism, but the optimizer decides there is no advantage in doing that here. This is not necessarily correct either. Unfortunately, there is no built-in query hint to get a parallel plan, but there is an undocumented way:

UPDATE t1
  SET t1.HHID = t2.HHID
  FROM dbo.persons AS t1
  INNER JOIN dbo.households AS t2
  ON t1.tempId = t2.tempId AND t1.n = t2.n
OPTION (LOOP JOIN, QUERYTRACEON 8649);

Enabling trace flag 8649 with the QUERYTRACEON hint produces this plan:

Parallel Nested Loops Plan

Now we have a plan that avoids the sort, requires no extra memory for the join, and uses parallelism effectively. You should find this query performs much better than the alternatives.

More information on parallelism in my article Forcing a Parallel Query Execution Plan:

Questions

Best Answer

Related Solutions

Sql-server – Forcing Join Order

How to Efficiently Update a Table Using JOIN in SQL Server

Query Plan Analysis

Why Sort and Hash?

Improving Performance

Nested Loops Join

Parallel Nested Loops

Related Question