SQL Execution Plan – Merge Join Turns to Hash Match with Second Join

execution-planjoin;

I would like to educate myself on what goes on under the hood in SQL Server, so I'm about to delve into Grant Fritchey's book Dissecting SQL Server Execution Plans. As it's 181 pages I just wanted to ask this simple question, which is what first got me interested in this in the first place – hopefully it will pique my interest (plus I'm too impatient to wait until I've waded through this tome to get some kind of answer!)

I am using this miniature version of Northwind to run this simple query:

SELECT Orders.OrderID
  ,Orders.OrderDate
  ,[Order Details].UnitPrice
FROM Orders
JOIN [Order Details]
  ON Orders.OrderID = [Order Details].OrderID

Which gives the following execution plan:

Which seems sensible, given both tables have indexing / sorting on the joining fields. But if I add another table (products) like so:

SELECT Orders.OrderID
  ,Orders.OrderDate
  ,[Order Details].UnitPrice
  ,Products.ProductName
FROM Orders
JOIN [Order Details]
  ON Orders.OrderID = [Order Details].OrderID
JOIN Products
  ON Products.ProductID = [Order Details].ProductID

Suddenly I have Hash matches. This seemed odd to me (remember I'm just beginning to learn this stuff!), as I thought hash matches were for large, unsorted joins. Why would SQL think one join type is okay in the first query and not the second, even though surely they are joining the same number of rows / same indexes etc between Orders and Order Details in both queries?

Orders has 830 rows, Order Details 2155 and Products 77.

OrderID has a clustered indexes on Orders and Order Details, and ProductID has a non-clustered index on Order Details and clustered on Products.

Thanks

Best Answer

I think this stack exchange link gives a good rundown of why it is doing what it is doing, and I believe it has most to do with the size of the results sets and how they are indexed/sorted in your case. Order and Order Details, from this database are 830 rows and 2155 rows respectively. These are similar results sets where one has 2 rows (roughly) for every one row in the other table. Order Details is clustered on OrderID, ProductID - meaning that the table is also sorted that way. Orders is obviously clustered on OrderID as well, meaning that it is also sorted on orderid. I believe this makes it easy to merge these two data sets that are already sorted in this way together.

Then you throw in Products. If you run the query

SELECT * FROM dbo.[Order Details] AS od
ORDER BY OrderID, od.ProductID

This is basically showing you what the logical order of the Order Details table is. And if you look at it from a products perspective it looks basically random. So now you are throwing in a join to a table that is only 77 rows and where the sorting does not really match up.

I believe that is why you see the Merge Join for the first and the Hash for the second.

Read that link above as it gives a nice description of these things and how they work. Another good one is at this other SE article.

Lots of information in those two and at many other places (including your book).

Related Solutions

SQL Server – Execution Plan Basics: Hash Match Confusion

As SQLRockstar's answer quotes

best for large, unsorted inputs.

Now,

from the Users.DisplayName index scan (assumed nonclustered) you get Users.Id (assuming clustered) = unsorted
You are also scanning Posts for OwnerUserId = unsorted

This is 2 unordered inputs.

I'd consider an index on the Posts table on OwnerUserId, including Title. This will add some order on one side of the input to the JOIN + it will be covering index

CREATE INDEX IX_OwnerUserId ON Posts (OwnerUserId) INCLUDE (Title)

You may then find that the Users.DisplayName index won't be used and it will scan the PK instead.

Improving Query Performance by Removing Hash Match Inner Join in SQL Server 2014

the following links will provide a good source of knowledge regarding execution plans.

From Execution Plan Basics — Hash Match Confusion I found:

From http://sqlinthewild.co.za/index.php/2007/12/30/execution-plan-operations-joins/

"The hash join is one of the more expensive join operations, as it requires the creation of a hash table to do the join. That said, it’s the join that’s best for large, unsorted inputs. It is the most memory-intensive of any of the joins

The hash join first reads one of the inputs and hashes the join column and puts the resulting hash and the column values into a hash table built up in memory. Then it reads all the rows in the second input, hashes those and checks the rows in the resulting hash bucket for the joining rows."

which links to this post:

http://blogs.msdn.com/b/craigfr/archive/2006/08/10/687630.aspx

Can you explain this execution plan? provides good insights about the execution plan with, not specific to hash match but relevant.

The constant scans are a way for SQL Server to create a bucket into which it's going to place something later in the execution plan. I've posted a more thorough explanation of it here. To understand what the constant scan is for, you have to look further into the plan. In this case, it's the Compute Scalar operators that are being used to populate the space created by the constant scan.

The Compute Scalar operators are being loaded up with NULL and the value 1045876, so they're clearly going to be used with the Loop Join in an effort to filter the data.

The really cool part is that this plan is Trivial. It means that it went through a minimal optimization process. All the operations are leading up to the Merge Interval. This is used to create a minimal set of comparison operators for an index seek (details on that here).

In this question: Can I get SSMS to show me the Actual query costs in the Execution plan pane? I'm fixing performance issues on a multistatement stored procedure in SQL Server. I want to know which part(s) I should spend time on.

I understand from How do I read Query Cost, and is it always a percentage? that even when SSMS is told to Include Actual Execution Plan, the "Query cost (relative to the batch)" figures is still based on cost estimates, which can be far off actuals

Measuring Query Performance : “Execution Plan Query Cost” vs “Time Taken” gives good info for when you need to compare the performance of 2 different queries.

In Reading a SQL Server Execution plan you can find great tips for reading the execution plan.

Other questions/answers that I really liked because they are relevant to this subject, and for my personal reference I would like to quote are:

How to optimise T-SQL query using Execution Plan

can sql generate a good plan for this procedure?

Execution Plans Differ for the Same SQL Statement

Best Answer

Related Solutions

SQL Server – Execution Plan Basics: Hash Match Confusion

Improving Query Performance by Removing Hash Match Inner Join in SQL Server 2014

Related Question