Sql-server – Specify build and probe tables to hash join

hashingjoin;sql server

I am observing a situation in AdventureWorks database where the hash join operator is building hash table using a larger input (296 rows) and then probing with a much smaller input (3 rows). I'm not able to understand why this reverse mapping is favoured by SQL server.

Following is the query and plan:

SELECT *
FROM HumanResources.Employee e
INNER MERGE JOIN HumanResources.EmployeeDepartmentHistory edh ON e.BusinessEntityID = edh.BusinessEntityID
INNER HASH JOIN HumanResources.Shift s ON edh.ShiftID = s.ShiftID

The plan is here: https://www.brentozar.com/pastetheplan/?id=Bk_YvD1iX

I wonder if I can specifically instruct SQL to use HumanResources.Shift for build and the other input to probe?

Best Answer

When you hint a query like that, you force the join order -- you've actually answered your own question.

If you want to test join types and still leave the optimizer with room to explore things, use query options, like this:

SELECT *
FROM HumanResources.Employee e
INNER JOIN HumanResources.EmployeeDepartmentHistory edh 
ON e.BusinessEntityID = edh.BusinessEntityID
INNER JOIN HumanResources.Shift s 
ON edh.ShiftID = s.ShiftID
OPTION(MERGE JOIN, HASH JOIN);

Related Solutions

Sql-server – Finding distinct rows across two tables: Full Outer Join more efficient than Union

The semantics of the two queries are not the same - UNION removes duplicates, whereas the FULL OUTER JOIN will not:

DECLARE @T1 AS table (id bigint NULL, val integer NULL);
DECLARE @T2 AS table (id bigint NULL, val integer NULL);

INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);

SELECT COALESCE(t1.id, t2.id) AS id, COALESCE(t1.val, t2.val) AS val
FROM @t1 t1
FULL OUTER JOIN @t2 t2
    ON t2.id = t1.id
    AND t2.val = t1.val;

SELECT t1.id, t1.val
FROM @t1 t1
UNION 
SELECT t2.id, t2.val
FROM @t2 t2;

Output:

╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║  1 ║   1 ║
║  1 ║   1 ║
║  1 ║   1 ║
║  1 ║   1 ║
╚════╩═════╝

╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║  1 ║   1 ║
╚════╩═════╝

That said, the optimizer does not know many FOJN tricks, so it is always possible that there is a better way to express the query than the natural UNION. Only commonly-useful and always-correct transformations are implemented.

Note that with a unique constraint only on the larger table, the optimizer chooses a hash union, without expensive duplicate-removal on the probe input, that makes it choose Concat Union All in the question example:

ALTER TABLE #t2 
ADD CONSTRAINT UQ2 
UNIQUE CLUSTERED (id);

SELECT COUNT(*), AVG(x.id), AVG(x.val)
FROM (
    SELECT t1.id, t1.val
    FROM #t1 t1
    UNION
    SELECT t2.id, t2.val
    FROM #t2 t2
) AS x;

The FOJN rewrite may well be a useful one in cases where you know there cannot be duplicates within each input set, but this condition is not enforced with a unique constraint or index (particularly on the large input).

If such a uniqueness guarantee does exist, and yet the optimizer does not select a Hash Union, you might try an OPTION (HASH UNION) hint, to see how it compares.

DB2 LUW: how to influence query planner’s choice of join

Hash joins require an "equijoin predicate". So I rewrote the query as an explicit join (instead of IN...(subselect)), and instead of using A.key = B.key as the join condition, I used A.key > B.key - 1 AND A.key < B.key +1.

Best Answer

Related Solutions

Sql-server – Finding distinct rows across two tables: Full Outer Join more efficient than Union

DB2 LUW: how to influence query planner’s choice of join

Related Question