Sql-server – Limit for HASH JOINs

join;sql serversql server 2014

I have a very complex query (who hasn't :-)). In this query is a JOIN to a table like this one (of course the original has some more columns):

CREATE TABLE dbo.tbl_detail (id bigint PRIMARY key, 
                   descr varchar(300), 
                   txt varchar(MAX),
                   descr_p varchar(300)
                  );

There are two joins to this table in the query:

  • LEFT JOIN dbo.tbl_detail td1 ON td1.id = main.detail_id
  • LEFT JOIN dbo.tbl_detail td2 ON td2.id = main.second_id
  • The query returns ~200k rows.
  • td1 outputs all columns
  • td2 outputs all columns except [txt] (varchar(max))

Problem:

  • td2 uses an HASH JOIN
  • but td1 uses an NESTED LOOKUP (over 200k rows; no statistic problem, since the estimates are correct)
  • when I remove the [txt] column from the td1 output it uses a HASH JOIN too
  • when I alter the column to VARCHAR(5000) (or smaller) it uses a HASH JOIN
  • when I alter the column to VARCHAR(8000) it uses the NESTED LOOKUP again
  • OPTION (HASH JOIN) or LEFT HASH JOIN would work, but produces an extrem slow query plan

Question:
Why does it not always use the HASH JOIN? Is there a length limit (per column or sum of all output columns)?

PS: Microsoft SQL Server 2014 (SP2) Developer Edition

Best Answer

It's important to remember that the query optimizer doesn't choose each join individually and independently from everything else going in on the query. Joins have different properties which means that a different join type may be better or worse depending on other joins or operations in the plan. I can generate test data that shows behavior similar to what you see here, but the reproduction depends on available memory for the server.

Put 10k rows each into two tables:

CREATE TABLE dbo.tbl_detail (
    id bigint PRIMARY key, 
    descr varchar(300), 
    descr_p varchar(300),
    txt_max varchar(MAX),
    txt_5k varchar(5000)
);

INSERT INTO dbo.tbl_detail WITH (TABLOCK)
SELECT TOP (10000)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('DESCR', 60)
, REPLICATE('descr_p', 40)
, NULL
, NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

CREATE TABLE dbo.OUTER_TABLE (
    id bigint primary key,
    id2 bigint
);

INSERT INTO dbo.OUTER_TABLE WITH (TABLOCK)
SELECT TOP (10000)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

Here is the query to test:

SELECT o.*, d.id, d.descr, d.descr_p
FROM OUTER_TABLE o
LEFT OUTER JOIN tbl_detail d ON o.ID2 = d.ID
ORDER BY o.ID;

If I add d.txt_5k to the SELECT list I naturally get a hash join:

natural hash

However, if I add d.txt_max to the SELECT list I naturally get a loop join:

natural loop

The key here is that a loop join preserves the order of the outer table. That means that a nested loop join can avoid an explicit sort for some queries. Queries with a hash join may need to do the sort. The cost of the sort depends on (among other things) the estimated size of the data. The query optimizer estimates the size of the data based on the estimated number of rows and the data types. With the VARCHAR(5000) column I get an estimated size of 27 MB and with the VARCHAR(MAX) column I get an estimated size of 42 MB. On my machine the 42 MB sort is expected to spill to disk so that makes the hash join plan much more expensive than the loop plan for the VARCHAR(MAX) column.

We can see this more clearly by using hints to force the different plans so we can compare them. Here's the comparison for the VARCHAR(5000) query:

5k compare

And the comparison for the VARCHAR(MAX) query:

max compare

Of course, this may not be why you're seeing the behavior that you're seeing for your particular query. I just wanted to provide an example of why changing a data type could change the join type. I'm going to assume that there's a problem caused by the loop join and that you really need it to be hash join. The OPTION (HASH JOIN) hint may not be a good workaround because it will force every join to be a hash join in the query. The LEFT HASH JOIN hint may not be a good workaround because it implies a FORCE ORDER hint which means that the query optimizer will not be able to change the join order. Perhaps you can move the problematic part of the query to a temp table and apply whatever hints are needed to the smaller query. It's really hard to say much without more information.

You could also try a FORCESCAN hint to encourage the hash join, although I'm loathe to recommend it. Be sure to test carefully:

SELECT o.*, d.id, d.descr, d.descr_p, d.txt_max
FROM OUTER_TABLE o
LEFT OUTER JOIN tbl_detail d WITH (FORCESCAN) ON o.ID2 = d.ID 
ORDER BY o.ID
OPTION (MAXDOP 1);