Sql-server – Dissecting Joins

sql servert-sql

Recently I asked a question in dba.stackexchange the question was about data hierarchy in sql and a guy named chris allen answered the question and I really loved it but when I try to understand that query much more I got confused.

Here's the query:

Res_id  Res_name         Man_id
1           sam             3
2           Biju            4
3           adrian          Null
4           Helen           3
5           Micah           4


select  
 level1.res_name as level1,
 level2.res_name as level2,
 level3.res_name as level3,
 level4.res_name as level4,
 level5.res_name as level5

from resource as level1
left join resource as level2 on level1.res_id=level2.man_id
left join resource as level3 on level2.res_id=level3.man_id
left join resource as level4 on level3.res_id=level4.man_id
left join resource as level5 on level4.res_id=level5.man_id

where isnull(level1.man_id,0)=0

I did several approaches to crack and was in vain. I think I am missing something well my findings are given below

First he named four columns as each level 1,2,3 etc
Each time he left joined the same table 4 time each with different aliases
By giving condition on each join the table is getting changed to next level
Where condition to specify the manager

The part I failed to understand is the join he made even though I know the use of left join where we will get the columns of left table and on the right if no match is found it will be NULL.So I am looking for some visualization of the join happening here or some understanding of the joins in depth of this particular scenario.If the question doesn't fit here I will delete it . I also asked the same in the question I posted but there was no reply that is why I am putting a new question

Best Answer

Try this query without the joins first:

select  
 level1.res_name as level1

from resource as level1

where isnull(level1.man_id,0)=0

This returns the rows where man_id isn't referencing anything. In other words, this returns the topmost-level managers only. For your particular example it works like this:

the FROM clause returns the entire table:

level1.Res_id  level1.Res_name  level1.Man_id
-------------  ---------------  -------------
1              sam              3
2              Biju             4
3              adrian           NULL
4              Helen            3
5              Micah            4

the WHERE clause leaves just one row:

level1.Res_id  level1.Res_name  level1.Man_id
-------------  ---------------  -------------
3              adrian           NULL

finally, the SELECT takes from it only one column:
```
level1
------
adrian
```

Adding one self-join like in Chris's query,

select  
 level1.res_name as level1,
 level2.res_name as level2

from resource as level1
left join resource as level2 on level1.res_id=level2.man_id

where isnull(level1.man_id,0)=0

gives you every top-level manager's direct subordinates. More specifically, the join results in the following row set:

level1.Res_id  level1.Res_name  level1.Man_id  level2.Res_id  level2.Res_name  level2.Man_id
-------------  ---------------  -------------  -------------  ---------------  -------------
1              sam              3              NULL           NULL             NULL
2              Biju             4              NULL           NULL             NULL
3              adrian           NULL           1              sam              3
3              adrian           NULL           4              Helen            3
4              Helen            3              2              Biju             4
4              Helen            3              2              Biju             4
5              Micah            4              NULL           NULL             NULL

The WHERE clause filters it down to just this:

level1.Res_id  level1.Res_name  level1.Man_id  level2.Res_id  level2.Res_name  level2.Man_id
-------------  ---------------  -------------  -------------  ---------------  -------------
3              adrian           NULL           1              sam              3
3              adrian           NULL           4              Helen            3

And finally the SELECT clause returns only the names:

level1  level2
------  ------
adrian  sam
adrian  Helen

In the same manner, adding one more join, like this:

select  
 level1.res_name as level1,
 level2.res_name as level2,
 level3.res_name as level3

from resource as level1
left join resource as level2 on level1.res_id=level2.man_id
left join resource as level3 on level2.res_id=level3.man_id

where isnull(level1.man_id,0)=0

brings about the next level of subordinates:

this is the result of joining the third instance of the table to the results of the first join:

1.Res_id  1.Res_name  1.Man_id  2.Res_id  2.Res_name  2.Man_id  3.Res_id  3.Res_name  3.Man_id
--------  ----------  --------  --------  ----------  --------  --------  ----------  --------
1         sam         3         NULL      NULL        NULL      NULL      NULL        NULL
2         Biju        4         NULL      NULL        NULL      NULL      NULL        NULL
3         adrian      NULL      1         sam         3         NULL      NULL        NULL
3         adrian      NULL      4         Helen       3         2         Biju        4
3         adrian      NULL      4         Helen       3         5         Micah       4
4         Helen       3         2         Biju        4         NULL      NULL        NULL
4         Helen       3         5         Micah       4         NULL      NULL        NULL
5         Micah       4         NULL      NULL        NULL      NULL      NULL        NULL

(I've shortened the table aliases for convenience: 1 stands for level1, 2 for level2, and 3 for level3)

this is what remains after the WHERE filter:

1.Res_id  1.Res_name  1.Man_id  2.Res_id  2.Res_name  2.Man_id  3.Res_id  3.Res_name  3.Man_id
--------  ----------  --------  --------  ----------  --------  --------  ----------  --------
3         adrian      NULL      1         sam         3         NULL      NULL        NULL
3         adrian      NULL      4         Helen       3         2         Biju        4
3         adrian      NULL      4         Helen       3         5         Micah       4

and this is what SELECT extracts from the above and gives you back:

level1  level2  level3
------  ------  ------
adrian  sam     NULL
adrian  Helen   Biju
adrian  Helen   Micah

Same continues for the rest of the joins. With your example, however, there will be no more rows, as the hierarchy in the table doesn't go deeper than two levels. Consequently, the other two columns in Chris's original query, level4 and level5, will return NULLs:

level1  level2  level3  level4  level5
------  ------  ------  ------  ------
adrian  sam     NULL    NULL    NULL
adrian  Helen   Biju    NULL    NULL
adrian  Helen   Micah   NULL    NULL

If Chris's assumption that four joins should be enough for most cases does not work for your particular case (and I don't mean the simple example in your question, of course), you can throw in more self-joins using the same pattern.

Related Solutions

Sql-server – Outer Apply vs Left Join Performance

Can anyone tell how exactly apply works and how will it effect the performance in very large data

APPLY is a correlated join (called a LATERAL JOIN in some products and newer versions of the SQL Standard). Like any logical construction, it has no direct impact on performance. In principle, we should be able to write a query using any logically equivalent syntax, and the optimizer would transform our input into exactly the same physical execution plan.

Of course, this would require the optimizer to know every possible transformation, and to have the time to consider each one. This process might well take longer than the current age of the universe, so most commercial products do not take this approach. Therefore, query syntax can, and often does, have an impact on final performance, though it is difficult to make general statements about which is better and why.

The specific form of OUTER APPLY ( SELECT TOP ... ) is most likely to result in a correlated nested loops join in current versions of SQL Server, because the optimizer does not contain logic to transform this pattern to an equivalent JOIN. Correlated nested loops join may not perform well if the outer input is large, and the inner input is unindexed, or the pages needed are not already in memory. In addition, specific elements of the optimizer's cost model mean a correlated nested loops join is less likely than a semantically-identical JOIN to produce a parallel execution plan.

I was able to make same query with single left join and row_number()

This may or may not be better in the general case. You will need to performance test both alternatives with representative data. The LEFT JOIN and ROW_NUMBER certainly has potential to be more efficient, but it depends on the precise query plan shape chosen. The primary factors that affect the efficiency of this approach is the availability of an index to cover the columns needed, and to supply the order needed by the PARTITION BY and ORDER BY clauses. A second factor is the size of the table. An efficient and well-indexed APPLY can out-perform a ROW_NUMBER with optimal indexing if the query touches a relatively small portion of the table concerned. Testing is needed.

Sql-server – Create hierarchy of multiple levels where each node has a random number of children

^{(O.P.'s note: preferred solution is the 4th / final code block)}

XML seems to me to be the obvious choice of data structure to use here.

with N as
(
  select T.N
  from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
              (12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)

select top(5 + abs(checksum(newid())) % 15)
  N1.N as '@Value',
  (
  select top(1 + abs(checksum(newid())) % 10)
    N2.N as '@Value',
    (
    select top(1 + abs(checksum(newid())) % 5)
      N3.N as '@Value'
    from N as N3
    where N2.N > 0
    for xml path('Level3'), type
    )
  from N as N2
  where N1.N > 0
  for xml path('Level2'), type
  )
from N as N1
for xml path('Level1'), root('Root');

The trick to make SQL Server use different values for top() for each node is to make the sub-queries correlated. N1.N > 0 and N2.N > 0.

Flatteing the XML:

declare @X xml;

with N as
(
  select T.N
  from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
              (12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)
select @X  = (
             select top(5 + abs(checksum(newid())) % 15)
               N1.N as '@Value',
               (
               select top(1 + abs(checksum(newid())) % 10)
                 N2.N as '@Value',
                 (
                 select top(1 + abs(checksum(newid())) % 5)
                   N3.N as '@Value'
                 from N as N3
                 where N2.N > 0
                 for xml path('Level3'), type
                 )
               from N as N2
               where N1.N > 0
               for xml path('Level2'), type
               )
             from N as N1
             for xml path('Level1')
             );


select L1.X.value('@Value', 'varchar(10)')+'\'+
       L2.X.value('@Value', 'varchar(10)')+'\'+
       L3.X.value('@Value', 'varchar(10)')
from @X.nodes('/Level1') as L1(X)
  cross apply L1.X.nodes('Level2') as L2(X)
  cross apply L2.X.nodes('Level3') as L3(X);

And a version totally void of XML.

with N as
(
  select T.N
  from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
              (12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)
select cast(N1.N as varchar(10))+'\'+
       cast(N2.N as varchar(10))+'\'+
       cast(N3.N as varchar(10))
from (
     select top(5 + abs(checksum(newid())) % 15)
       N.N
     from N
     ) as N1
  cross apply
     (
     select top(1 + abs(checksum(newid())) % 10)
       N.N
     from N
     where N1.N > 0
     ) as N2
  cross apply
     (
     select top(1 + abs(checksum(newid())) % 5)
       N.N
     from N
     where N2.N > 0
     ) as N3;

Correlation N1.N > 0 and N2.N > 0 is still important.

A version using a table with 20 names to be used instead of just integers.

declare @Elements table
(
  Name nvarchar(50) not null
);

insert into @Elements(Name)
select top(20) C.name 
from sys.columns as C
group by C.name;

select N1.Name + N'\' + N2.Name + N'\' + N3.Name
from (
     select top(5 + abs(checksum(newid())) % 15)
       E.Name
     from @Elements as E
     ) as N1
  cross apply
     (
     select top(1 + abs(checksum(newid())) % 10)
       E.Name
     from @Elements as E
     where N1.Name > ''
     ) as N2
  cross apply
     (
     select top(1 + abs(checksum(newid())) % 5)
       E.Name
     from @Elements as E
     where N2.Name > ''
     ) as N3;

Best Answer

Related Solutions

Sql-server – Outer Apply vs Left Join Performance

Sql-server – Create hierarchy of multiple levels where each node has a random number of children

Related Question