Can anyone tell how exactly apply works and how will it effect the performance in very large data
APPLY
is a correlated join (called a LATERAL JOIN
in some products and newer versions of the SQL Standard). Like any logical construction, it has no direct impact on performance. In principle, we should be able to write a query using any logically equivalent syntax, and the optimizer would transform our input into exactly the same physical execution plan.
Of course, this would require the optimizer to know every possible transformation, and to have the time to consider each one. This process might well take longer than the current age of the universe, so most commercial products do not take this approach. Therefore, query syntax can, and often does, have an impact on final performance, though it is difficult to make general statements about which is better and why.
The specific form of OUTER APPLY ( SELECT TOP ... )
is most likely to result in a correlated nested loops join in current versions of SQL Server, because the optimizer does not contain logic to transform this pattern to an equivalent JOIN
. Correlated nested loops join may not perform well if the outer input is large, and the inner input is unindexed, or the pages needed are not already in memory. In addition, specific elements of the optimizer's cost model mean a correlated nested loops join is less likely than a semantically-identical JOIN
to produce a parallel execution plan.
I was able to make same query with single left join and row_number()
This may or may not be better in the general case. You will need to performance test both alternatives with representative data. The LEFT JOIN
and ROW_NUMBER
certainly has potential to be more efficient, but it depends on the precise query plan shape chosen. The primary factors that affect the efficiency of this approach is the availability of an index to cover the columns needed, and to supply the order needed by the PARTITION BY
and ORDER BY
clauses. A second factor is the size of the table. An efficient and well-indexed APPLY
can out-perform a ROW_NUMBER
with optimal indexing if the query touches a relatively small portion of the table concerned. Testing is needed.
(O.P.'s note: preferred solution is the 4th / final code block)
XML seems to me to be the obvious choice of data structure to use here.
with N as
(
select T.N
from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
(12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)
select top(5 + abs(checksum(newid())) % 15)
N1.N as '@Value',
(
select top(1 + abs(checksum(newid())) % 10)
N2.N as '@Value',
(
select top(1 + abs(checksum(newid())) % 5)
N3.N as '@Value'
from N as N3
where N2.N > 0
for xml path('Level3'), type
)
from N as N2
where N1.N > 0
for xml path('Level2'), type
)
from N as N1
for xml path('Level1'), root('Root');
The trick to make SQL Server use different values for top()
for each node is to make the sub-queries correlated. N1.N > 0
and N2.N > 0
.
Flatteing the XML:
declare @X xml;
with N as
(
select T.N
from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
(12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)
select @X = (
select top(5 + abs(checksum(newid())) % 15)
N1.N as '@Value',
(
select top(1 + abs(checksum(newid())) % 10)
N2.N as '@Value',
(
select top(1 + abs(checksum(newid())) % 5)
N3.N as '@Value'
from N as N3
where N2.N > 0
for xml path('Level3'), type
)
from N as N2
where N1.N > 0
for xml path('Level2'), type
)
from N as N1
for xml path('Level1')
);
select L1.X.value('@Value', 'varchar(10)')+'\'+
L2.X.value('@Value', 'varchar(10)')+'\'+
L3.X.value('@Value', 'varchar(10)')
from @X.nodes('/Level1') as L1(X)
cross apply L1.X.nodes('Level2') as L2(X)
cross apply L2.X.nodes('Level3') as L3(X);
And a version totally void of XML.
with N as
(
select T.N
from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
(12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)
select cast(N1.N as varchar(10))+'\'+
cast(N2.N as varchar(10))+'\'+
cast(N3.N as varchar(10))
from (
select top(5 + abs(checksum(newid())) % 15)
N.N
from N
) as N1
cross apply
(
select top(1 + abs(checksum(newid())) % 10)
N.N
from N
where N1.N > 0
) as N2
cross apply
(
select top(1 + abs(checksum(newid())) % 5)
N.N
from N
where N2.N > 0
) as N3;
Correlation N1.N > 0
and N2.N > 0
is still important.
A version using a table with 20 names to be used instead of just integers.
declare @Elements table
(
Name nvarchar(50) not null
);
insert into @Elements(Name)
select top(20) C.name
from sys.columns as C
group by C.name;
select N1.Name + N'\' + N2.Name + N'\' + N3.Name
from (
select top(5 + abs(checksum(newid())) % 15)
E.Name
from @Elements as E
) as N1
cross apply
(
select top(1 + abs(checksum(newid())) % 10)
E.Name
from @Elements as E
where N1.Name > ''
) as N2
cross apply
(
select top(1 + abs(checksum(newid())) % 5)
E.Name
from @Elements as E
where N2.Name > ''
) as N3;
Best Answer
Try this query without the joins first:
This returns the rows where
man_id
isn't referencing anything. In other words, this returns the topmost-level managers only. For your particular example it works like this:the FROM clause returns the entire table:
the WHERE clause leaves just one row:
finally, the SELECT takes from it only one column:
Adding one self-join like in Chris's query,
gives you every top-level manager's direct subordinates. More specifically, the join results in the following row set:
The WHERE clause filters it down to just this:
And finally the SELECT clause returns only the names:
In the same manner, adding one more join, like this:
brings about the next level of subordinates:
this is the result of joining the third instance of the table to the results of the first join:
(I've shortened the table aliases for convenience:
1
stands forlevel1
,2
forlevel2
, and3
forlevel3
)this is what remains after the WHERE filter:
and this is what SELECT extracts from the above and gives you back:
Same continues for the rest of the joins. With your example, however, there will be no more rows, as the hierarchy in the table doesn't go deeper than two levels. Consequently, the other two columns in Chris's original query,
level4
andlevel5
, will return NULLs:If Chris's assumption that four joins should be enough for most cases does not work for your particular case (and I don't mean the simple example in your question, of course), you can throw in more self-joins using the same pattern.