Sql-server – Is left hash join always better than left outer join

hashingsql server

I have a query that runs quite slow (see below). While searching for a way to improve it we found that the query runs about ten times faster if we replaced the LEFT OUTER JOIN with LEFT HASH JOIN

The result seems to be the same. Is it? In what cases wont it return the same result? What difference are there? Is there something that i should look out for when running a LEFT HASH JOIN instead of a LEFT OUTER JOIN?

The [ABC].[ExternalTable] table in the query is a view on a different server that i have added as a external table

SELECT t.foo, t.bar, t.data
FROM [dbo].[Table] as t
LEFT OUTER JOIN [ABC].[ExternalTable] as s ON s.foo = t.foo and s.bar = t.bar and s.data = t.data
WHERE s.foo is null and s.bar IS NULL and s.data IS NULL

Best Answer

Semantically, both queries are the same. The LOOP versus the HASH simply tells SQL Server which option to use to return results. If you run the query without LOOP or HASH, SQL Server may pick either of those options, depending on which it thinks will perform best. Both options return identical results by design. To put it slightly differently¹, the keywords HASH and LOOP in this case are what is known as join hints, one of the three kinds of hints in Transact-SQL. To confuse you even more, you can also specify a join hint as a query hint, though the effect is not necessarily the same in both cases. Be aware that providing these kinds of join hints implies a FORCE ORDER hint, which specifies that the join order indicated by the query syntax is preserved during query optimization (see Paul's answer below for further details).

SQL Server uses statistics on each table in the query to make an informed choice about what kind of physical operation to take with for each JOIN in the T-SQL query statement.

In this case, since [ExternalTable] is a view referenced through a linked server, SQL Server probably expects there to be 1 row in the table - i.e. it has no idea how many rows to expect.

You can either add a join hint to your query to force a merge join, or simply copy rows from [ExternalTable] into a local #temp table with a clustered index, then run the query against that.

The full syntax for the hash join would be:

LEFT OUTER HASH JOIN [ABC].[ExternalTable] s ON s.foot = t.foo .....

The version with the HASH in the join statement is allowing SQL Server to choose the join type, and in your particular case it's most likely choosing a LOOP join, which you could force with:

LEFT OUTER LOOP JOIN [ABC].[ExternalTable] s ON s.foot = t.foo .....

I typically would not recommend specifying the join type since most of the time SQL Server is very capable of choosing the most appropriate style of join operator.

^{1 - thanks Andriy for the wording here.}

Related Solutions

Sql-server – Outer Apply vs Left Join Performance

Can anyone tell how exactly apply works and how will it effect the performance in very large data

APPLY is a correlated join (called a LATERAL JOIN in some products and newer versions of the SQL Standard). Like any logical construction, it has no direct impact on performance. In principle, we should be able to write a query using any logically equivalent syntax, and the optimizer would transform our input into exactly the same physical execution plan.

Of course, this would require the optimizer to know every possible transformation, and to have the time to consider each one. This process might well take longer than the current age of the universe, so most commercial products do not take this approach. Therefore, query syntax can, and often does, have an impact on final performance, though it is difficult to make general statements about which is better and why.

The specific form of OUTER APPLY ( SELECT TOP ... ) is most likely to result in a correlated nested loops join in current versions of SQL Server, because the optimizer does not contain logic to transform this pattern to an equivalent JOIN. Correlated nested loops join may not perform well if the outer input is large, and the inner input is unindexed, or the pages needed are not already in memory. In addition, specific elements of the optimizer's cost model mean a correlated nested loops join is less likely than a semantically-identical JOIN to produce a parallel execution plan.

I was able to make same query with single left join and row_number()

This may or may not be better in the general case. You will need to performance test both alternatives with representative data. The LEFT JOIN and ROW_NUMBER certainly has potential to be more efficient, but it depends on the precise query plan shape chosen. The primary factors that affect the efficiency of this approach is the availability of an index to cover the columns needed, and to supply the order needed by the PARTITION BY and ORDER BY clauses. A second factor is the size of the table. An efficient and well-indexed APPLY can out-perform a ROW_NUMBER with optimal indexing if the query touches a relatively small portion of the table concerned. Testing is needed.

Sql-server – LEFT OUTER JOIN versus WHERE IS NULL

The LEFT OUTER JOIN ON version is the standard way to write this logic, so I would highly recommend using this version rather than fiddling around with OR for performance reasons.

However, as ypercube has helpfully pointed out, the queries are not quite equivalent if the data and sharding methodology means that a Table1.Id value can end up in a different ShardKey than the ShardKey of the corresponding Table1.table2 value. If that happens, then the OR t2.ShardKey IS NULL portion of the 2nd query will be triggered and retain a row with matching Ids even if the ShardKey does not match.

In general, I suspect that they will typically result in similar query plans. Given two equivalent queries (which yours may or may not be, depending on the sharding methodology), by formulating the query in the standard way you are more likely to be working with the query optimizer rather than forcing it to recognize that the two queries can be expressed in similar ways and apply the appropriate simplifications.

If you want to be even more explicit about applying @shardkey to each table in the join, you can do that directly in the LEFT OUTER JOIN. However, SQL Server will be able to easily recognize that your original query is equivalent to this one.

SELECT t1.id, t2.ShardKey
FROM Table1 t1
LEFT OUTER JOIN Table2 t2 on t1.table2 = t2.id and t2.ShardKey = @shardkey
WHERE t1.id = @id and t1.ShardKey = @shardkey

An example of different query plans

On my SQL 2014 instance, the queries have (slightly) different query plans and the query plan for the LEFT OUTER JOIN ON version (top) is superior to the ORM-generated LEFT OUTER JOIN WHERE OR IS NULL version (bottom).

In particular, the seek into Table2 should be a seek on both id and ShardKey, but in the LEFT OUTER JOIN WHERE OR IS NULL version it is a seek only on id and a subsequent Filter operator tests all matching rows in order to apply the appropriate ShardKey restriction.

This difference may or may not be meaningful in your specific case (especially if Id is unique without adding in the ShardKey). However, if your real version actually uses table partitioning by ShardKey (and this just isn't represented in the simple sample query), you will likely be losing the benefit of partition elimination for the seek into Table2.

Best Answer

Related Solutions

Sql-server – Outer Apply vs Left Join Performance

Sql-server – LEFT OUTER JOIN versus WHERE IS NULL

Related Question