Sql-server – Limit for HASH JOINs

join;sql serversql server 2014

I have a very complex query (who hasn't :-)). In this query is a JOIN to a table like this one (of course the original has some more columns):

CREATE TABLE dbo.tbl_detail (id bigint PRIMARY key, 
                   descr varchar(300), 
                   txt varchar(MAX),
                   descr_p varchar(300)
                  );

There are two joins to this table in the query:

LEFT JOIN dbo.tbl_detail td1 ON td1.id = main.detail_id
LEFT JOIN dbo.tbl_detail td2 ON td2.id = main.second_id
The query returns ~200k rows.
td1 outputs all columns
td2 outputs all columns except [txt] (varchar(max))

Problem:

td2 uses an HASH JOIN
but td1 uses an NESTED LOOKUP (over 200k rows; no statistic problem, since the estimates are correct)
when I remove the [txt] column from the td1 output it uses a HASH JOIN too
when I alter the column to VARCHAR(5000) (or smaller) it uses a HASH JOIN
when I alter the column to VARCHAR(8000) it uses the NESTED LOOKUP again
OPTION (HASH JOIN) or LEFT HASH JOIN would work, but produces an extrem slow query plan

Question:
Why does it not always use the HASH JOIN? Is there a length limit (per column or sum of all output columns)?

PS: Microsoft SQL Server 2014 (SP2) Developer Edition

Best Answer

It's important to remember that the query optimizer doesn't choose each join individually and independently from everything else going in on the query. Joins have different properties which means that a different join type may be better or worse depending on other joins or operations in the plan. I can generate test data that shows behavior similar to what you see here, but the reproduction depends on available memory for the server.

Put 10k rows each into two tables:

CREATE TABLE dbo.tbl_detail (
    id bigint PRIMARY key, 
    descr varchar(300), 
    descr_p varchar(300),
    txt_max varchar(MAX),
    txt_5k varchar(5000)
);

INSERT INTO dbo.tbl_detail WITH (TABLOCK)
SELECT TOP (10000)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('DESCR', 60)
, REPLICATE('descr_p', 40)
, NULL
, NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

CREATE TABLE dbo.OUTER_TABLE (
    id bigint primary key,
    id2 bigint
);

INSERT INTO dbo.OUTER_TABLE WITH (TABLOCK)
SELECT TOP (10000)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

Here is the query to test:

SELECT o.*, d.id, d.descr, d.descr_p
FROM OUTER_TABLE o
LEFT OUTER JOIN tbl_detail d ON o.ID2 = d.ID
ORDER BY o.ID;

If I add d.txt_5k to the SELECT list I naturally get a hash join:

However, if I add d.txt_max to the SELECT list I naturally get a loop join:

The key here is that a loop join preserves the order of the outer table. That means that a nested loop join can avoid an explicit sort for some queries. Queries with a hash join may need to do the sort. The cost of the sort depends on (among other things) the estimated size of the data. The query optimizer estimates the size of the data based on the estimated number of rows and the data types. With the VARCHAR(5000) column I get an estimated size of 27 MB and with the VARCHAR(MAX) column I get an estimated size of 42 MB. On my machine the 42 MB sort is expected to spill to disk so that makes the hash join plan much more expensive than the loop plan for the VARCHAR(MAX) column.

We can see this more clearly by using hints to force the different plans so we can compare them. Here's the comparison for the VARCHAR(5000) query:

And the comparison for the VARCHAR(MAX) query:

Of course, this may not be why you're seeing the behavior that you're seeing for your particular query. I just wanted to provide an example of why changing a data type could change the join type. I'm going to assume that there's a problem caused by the loop join and that you really need it to be hash join. The OPTION (HASH JOIN) hint may not be a good workaround because it will force every join to be a hash join in the query. The LEFT HASH JOIN hint may not be a good workaround because it implies a FORCE ORDER hint which means that the query optimizer will not be able to change the join order. Perhaps you can move the problematic part of the query to a temp table and apply whatever hints are needed to the smaller query. It's really hard to say much without more information.

You could also try a FORCESCAN hint to encourage the hash join, although I'm loathe to recommend it. Be sure to test carefully:

SELECT o.*, d.id, d.descr, d.descr_p, d.txt_max
FROM OUTER_TABLE o
LEFT OUTER JOIN tbl_detail d WITH (FORCESCAN) ON o.ID2 = d.ID 
ORDER BY o.ID
OPTION (MAXDOP 1);

Related Solutions

Mysql – How to optimize a very slow query with joins and group by

OK, I ended up adding another lookup table:

CREATE TABLE IF NOT EXISTS `stops_routes` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `stop_id` varchar(100) NOT NULL,
  `route_id` varchar(100) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `stop_route` (`stop_id`,`route_id`),
  KEY `stop_id` (`stop_id`),
  KEY `route_id` (`route_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

Filling it was fairly fast:

mysql> insert into stops_routes (stop_id, route_id)
    ->
    -> select
    ->     s.stop_id,
    ->     r.route_id as from_route_id
    ->
    -> from routes r
    -> left join trips t on t.route_id = r.route_id
    -> left join stop_times st on st.trip_id = t.trip_id
    -> left join stops s on s.stop_id = st.stop_id
    -> group by s.stop_id, r.route_id;
Query OK, 3496 rows affected (8.38 sec)
Records: 3496  Duplicates: 0  Warnings: 0

Using it is blazingly fast:

mysql> select
    ->     r.route_id as from_route_id,
    ->     c_sr.route_id as to_route_id
    ->
    -> from routes r
    ->
    -> left join stops_routes sr on sr.route_id = r.route_id
    -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
    -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
    ->
    -> where r.route_id <> c_sr.route_id
    -> group by r.route_id, c_sr.route_id
    -> limit 10;
+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
| 0001          | 0002        |
| 0001          | 0003        |
| 0001          | 0004        |
| 0001          | 0005        |
| 0001          | 0006        |
| 0001          | 0008        |
| 0001          | 0009        |
| 0001          | 0011        |
| 0001          | 0014        |
| 0001          | 0031        |
+---------------+-------------+
10 rows in set (0.63 sec)

Now I can fill my last lookup table (set of connections between every routes on my GTFS network):

mysql> insert into route_connections (from_route_id, to_route_id)
    -> select
    ->     r.route_id as from_route_id,
    ->     c_sr.route_id as to_route_id
    ->
    -> from routes r
    ->
    -> left join stops_routes sr on sr.route_id = r.route_id
    -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
    -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
    ->
    -> where r.route_id <> c_sr.route_id
    -> group by r.route_id, c_sr.route_id;
Query OK, 2848 rows affected (0.31 sec)
Records: 2848  Duplicates: 0  Warnings: 0

Amazingly fast. I guess the engine couldn't break up the steps to optimize this.
I'd still be interested to know if it would be possible to get the same result (from route to route connections table) using only one sub-second or sub-minute query.

Sql-server – How many LEFT JOINS make a VIEW unusable

The link provided by Martin Smith to an existing SO question provided me with the answer:

The main restrictions are that foreign key relationships must be based on a single key to contribute to the simplification process

Best Answer

Related Solutions

Mysql – How to optimize a very slow query with joins and group by

Sql-server – How many LEFT JOINS make a VIEW unusable

Related Question