Can anyone tell how exactly apply works and how will it effect the performance in very large data
APPLY
is a correlated join (called a LATERAL JOIN
in some products and newer versions of the SQL Standard). Like any logical construction, it has no direct impact on performance. In principle, we should be able to write a query using any logically equivalent syntax, and the optimizer would transform our input into exactly the same physical execution plan.
Of course, this would require the optimizer to know every possible transformation, and to have the time to consider each one. This process might well take longer than the current age of the universe, so most commercial products do not take this approach. Therefore, query syntax can, and often does, have an impact on final performance, though it is difficult to make general statements about which is better and why.
The specific form of OUTER APPLY ( SELECT TOP ... )
is most likely to result in a correlated nested loops join in current versions of SQL Server, because the optimizer does not contain logic to transform this pattern to an equivalent JOIN
. Correlated nested loops join may not perform well if the outer input is large, and the inner input is unindexed, or the pages needed are not already in memory. In addition, specific elements of the optimizer's cost model mean a correlated nested loops join is less likely than a semantically-identical JOIN
to produce a parallel execution plan.
I was able to make same query with single left join and row_number()
This may or may not be better in the general case. You will need to performance test both alternatives with representative data. The LEFT JOIN
and ROW_NUMBER
certainly has potential to be more efficient, but it depends on the precise query plan shape chosen. The primary factors that affect the efficiency of this approach is the availability of an index to cover the columns needed, and to supply the order needed by the PARTITION BY
and ORDER BY
clauses. A second factor is the size of the table. An efficient and well-indexed APPLY
can out-perform a ROW_NUMBER
with optimal indexing if the query touches a relatively small portion of the table concerned. Testing is needed.
The LEFT OUTER JOIN ON version is the standard way to write this logic, so I would highly recommend using this version rather than fiddling around with OR
for performance reasons.
However, as ypercube has helpfully pointed out, the queries are not quite equivalent if the data and sharding methodology means that a Table1.Id
value can end up in a different ShardKey
than the ShardKey
of the corresponding Table1.table2
value. If that happens, then the OR t2.ShardKey IS NULL
portion of the 2nd query will be triggered and retain a row with matching Id
s even if the ShardKey
does not match.
In general, I suspect that they will typically result in similar query plans. Given two equivalent queries (which yours may or may not be, depending on the sharding methodology), by formulating the query in the standard way you are more likely to be working with the query optimizer rather than forcing it to recognize that the two queries can be expressed in similar ways and apply the appropriate simplifications.
If you want to be even more explicit about applying @shardkey
to each table in the join, you can do that directly in the LEFT OUTER JOIN
. However, SQL Server will be able to easily recognize that your original query is equivalent to this one.
SELECT t1.id, t2.ShardKey
FROM Table1 t1
LEFT OUTER JOIN Table2 t2 on t1.table2 = t2.id and t2.ShardKey = @shardkey
WHERE t1.id = @id and t1.ShardKey = @shardkey
An example of different query plans
On my SQL 2014 instance, the queries have (slightly) different query plans and the query plan for the LEFT OUTER JOIN ON version (top) is superior to the ORM-generated LEFT OUTER JOIN WHERE OR IS NULL version (bottom).
In particular, the seek into Table2
should be a seek on both id
and ShardKey
, but in the LEFT OUTER JOIN WHERE OR IS NULL version it is a seek only on id
and a subsequent Filter
operator tests all matching rows in order to apply the appropriate ShardKey
restriction.
This difference may or may not be meaningful in your specific case (especially if Id
is unique without adding in the ShardKey
). However, if your real version actually uses table partitioning by ShardKey
(and this just isn't represented in the simple sample query), you will likely be losing the benefit of partition elimination for the seek into Table2
.
Best Answer
Semantically, both queries are the same. The
LOOP
versus theHASH
simply tells SQL Server which option to use to return results. If you run the query withoutLOOP
orHASH
, SQL Server may pick either of those options, depending on which it thinks will perform best. Both options return identical results by design. To put it slightly differently1, the keywordsHASH
andLOOP
in this case are what is known as join hints, one of the three kinds of hints in Transact-SQL. To confuse you even more, you can also specify a join hint as a query hint, though the effect is not necessarily the same in both cases. Be aware that providing these kinds of join hints implies aFORCE ORDER
hint, which specifies that the join order indicated by the query syntax is preserved during query optimization (see Paul's answer below for further details).SQL Server uses statistics on each table in the query to make an informed choice about what kind of physical operation to take with for each
JOIN
in the T-SQL query statement.In this case, since
[ExternalTable]
is a view referenced through a linked server, SQL Server probably expects there to be 1 row in the table - i.e. it has no idea how many rows to expect.You can either add a join hint to your query to force a merge join, or simply copy rows from
[ExternalTable]
into a local #temp table with a clustered index, then run the query against that.The full syntax for the hash join would be:
The version with the
HASH
in the join statement is allowing SQL Server to choose the join type, and in your particular case it's most likely choosing a LOOP join, which you could force with:I typically would not recommend specifying the join type since most of the time SQL Server is very capable of choosing the most appropriate style of join operator.
1 - thanks Andriy for the wording here.