SQL Server – Create Hierarchy with Random Number of Children

hierarchyrandomsql servert-sql

I need to create some test data that involves a hierarchy. I could make it easy and do a couple of CROSS JOINs, but that would give me a structure that is completely uniform / without any variation. That not only seems dull, but lack of variation in test data sometimes masks problems that would otherwise be found. So, I am wanting to generate a non-uniform hierarchy that follows these rules:

3 levels deep
- Level 1 is randomly 5 – 20 nodes
- Level 2 is 1 – 10 nodes, random per each node of Level 1
- Level 3 is 1 – 5 nodes, random per each node of Level 2
All branches will be 3 levels deep. Uniformity in depth is ok at this point.
There can be overlap in names of child nodes on any given level (i.e. names of child nodes do not need to be unique across all nodes on the same level).
The term "random" is defined here as being pseudo-random, not uniquely random. This needs to be mentioned since the term "random" is often used to mean "random ordering of a given set that does not produce duplicates". I accept that random = random and if the number of children per each node of Level 1 is only 4, 7, and 8, even across 20 nodes on Level 1 that has a potential spread of 1 – 10 children per each of those nodes, then that is fine, because that is what random is.
Even though this can be done quite easily with nested WHILE loops, the preference is to find a set-based approach. Generally speaking, generating test data does not have the requirements for efficiency that Production code would have, but shooting for a set-based approach will likely be more educational and help in the future with finding set-based approaches to problems. So WHILE loops are not ruled-out, but can only be used if no set-based approach is possible.
Set-based = ideally a single query, regardless of CTEs, APPLYs, etc. So using an existing or inline numbers table is fine. Using a WHILE / CURSOR / procedural approach will not work. I suppose staging portions of the data into temp tables or table variables fine, just so long as the operations are all set-based, no loops. However, that being said, a single-query approach will probably be favored over multiple queries, unless it can be shown that the multi-query approach is actually better. Please also keep in mind that what constitutes "better" is typically subjective ;-). Please also keep in mind that the usage of "typically" in the prior sentence is also subjective.
Any version and edition of SQL Server (2005 and newer, I suppose) will do.
Only pure T-SQL: none of that silly SQLCLR stuff!! At least in terms of generating the data. Creating the directories and files will be done using SQLCLR. But here I am just focusing on generating the values of what to create.
T-SQL Multi-statement TVF are considered procedural, not set-based, even though on the outside they mask the procedural approach in a set. There are times when that is absolutely appropriate. This is not one of those time. Along those same lines, T-SQL Scalar functions are also not allowed, not only because they are also procedural, but the Query Optimizer sometimes caches their value and repeats it such that the output is not as expected.
T-SQL Inline TVFs (a.k.a. iTVFs) are okey-dokey as they are set-based, and effectively the same as using [ CROSS | OUTER ] APPLY, which was stated above as being ok.
Repeated executions of the query(ies) should produce mostly different result from the prior run.
Clarification Update 1: The final result set should be expressed as having one row for each distinct node of Level3, having the full path starting at Level1. This means that the Level1 and Level2 values will necessarily repeat across one or more rows, except in cases of there being only a single Level2 node containing only a single Level3 node.
Clarification Update 2: There is a very strong preference for each node having a name or label, and not just a number. This will allow for the resulting test data being more meaningful and realistic.

I am not sure if this additional info matters, but just in case it helps to have some context, the test data relates to my answer on this question:

Import XML files into SQL Server 2012

While not relevant at this point, the end goal of generating this hierarchy is to create a directory structure to test recursive file system methods. Levels 1 and 2 will be directories, and Level 3 will end up being the file name. I have searched around (both here and via the Googles) and have only found one reference to generating a random hierarchy:

Linux: create random directory/file hierarchy

That question (on StackOverflow) is actually quite close in terms of desired result since that also seeks to create a directory structure for testing. But that question (and the answers) are focused on Linux/Unix shell scripting and not so much the set-based world that we live in.

Now, I know how to generate random data, and am doing so already to create the contents of the files so that they can also show variations. The tricky part here is that the number of elements within each set is random, not a particular field. And, the number of elements within each node needs to be random from other nodes on that same levels.

Example Hierarchy

     Level 1
              Level 3
|---- A
|     |-- 1
|     |   |--- I
|     |
|     |-- 2
|         |--- III
|         |--- VI
|         |--- VII
|         |--- IX
|
|---- B
|     |-- 87
|         |--- AAA
|         |--- DDD
|
|---- C
      |-- ASDF
      |   |--- 11
      |   |--- 22
      |   |--- 33
      |
      |-- QWERTY
      |   |--- beft
      |
      |-- ROYGBP
          |--- Poi
          |--- Moi
          |--- Soy
          |--- Joy
          |--- Roy

Example Result Set Describing the Hierarchy Above

Level 1    Level 2    Level 3
A          1          I
A          2          III
A          2          VI
A          2          VII
A          2          IX
B          87         AAA
B          87         DDD
C          ASDF       11
C          ASDF       22
C          ASDF       33
C          QWERTY     beft
C          ROYGBP     Poi
C          ROYGBP     Moi
C          ROYGBP     Soy
C          ROYGBP     Joy
C          ROYGBP     Roy

Best Answer

^{(O.P.'s note: preferred solution is the 4th / final code block)}

XML seems to me to be the obvious choice of data structure to use here.

with N as
(
  select T.N
  from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
              (12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)

select top(5 + abs(checksum(newid())) % 15)
  N1.N as '@Value',
  (
  select top(1 + abs(checksum(newid())) % 10)
    N2.N as '@Value',
    (
    select top(1 + abs(checksum(newid())) % 5)
      N3.N as '@Value'
    from N as N3
    where N2.N > 0
    for xml path('Level3'), type
    )
  from N as N2
  where N1.N > 0
  for xml path('Level2'), type
  )
from N as N1
for xml path('Level1'), root('Root');

The trick to make SQL Server use different values for top() for each node is to make the sub-queries correlated. N1.N > 0 and N2.N > 0.

Flatteing the XML:

declare @X xml;

with N as
(
  select T.N
  from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
              (12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)
select @X  = (
             select top(5 + abs(checksum(newid())) % 15)
               N1.N as '@Value',
               (
               select top(1 + abs(checksum(newid())) % 10)
                 N2.N as '@Value',
                 (
                 select top(1 + abs(checksum(newid())) % 5)
                   N3.N as '@Value'
                 from N as N3
                 where N2.N > 0
                 for xml path('Level3'), type
                 )
               from N as N2
               where N1.N > 0
               for xml path('Level2'), type
               )
             from N as N1
             for xml path('Level1')
             );


select L1.X.value('@Value', 'varchar(10)')+'\'+
       L2.X.value('@Value', 'varchar(10)')+'\'+
       L3.X.value('@Value', 'varchar(10)')
from @X.nodes('/Level1') as L1(X)
  cross apply L1.X.nodes('Level2') as L2(X)
  cross apply L2.X.nodes('Level3') as L3(X);

And a version totally void of XML.

with N as
(
  select T.N
  from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
              (12),(13),(14),(15),(16),(17),(18),(19),(20)) as T(N)
)
select cast(N1.N as varchar(10))+'\'+
       cast(N2.N as varchar(10))+'\'+
       cast(N3.N as varchar(10))
from (
     select top(5 + abs(checksum(newid())) % 15)
       N.N
     from N
     ) as N1
  cross apply
     (
     select top(1 + abs(checksum(newid())) % 10)
       N.N
     from N
     where N1.N > 0
     ) as N2
  cross apply
     (
     select top(1 + abs(checksum(newid())) % 5)
       N.N
     from N
     where N2.N > 0
     ) as N3;

Correlation N1.N > 0 and N2.N > 0 is still important.

A version using a table with 20 names to be used instead of just integers.

declare @Elements table
(
  Name nvarchar(50) not null
);

insert into @Elements(Name)
select top(20) C.name 
from sys.columns as C
group by C.name;

select N1.Name + N'\' + N2.Name + N'\' + N3.Name
from (
     select top(5 + abs(checksum(newid())) % 15)
       E.Name
     from @Elements as E
     ) as N1
  cross apply
     (
     select top(1 + abs(checksum(newid())) % 10)
       E.Name
     from @Elements as E
     where N1.Name > ''
     ) as N2
  cross apply
     (
     select top(1 + abs(checksum(newid())) % 5)
       E.Name
     from @Elements as E
     where N2.Name > ''
     ) as N3;

Related Solutions

Modelling Hierarchical attributes

You could formalize the approach that you're already using and make it more efficient to both maintain and query. Consider an ERD like so:

ERD

Note that all of your hierarchy is flattened into a single set of columns. There is an involuted relationship so that each child entity in the hierarchy can point to it's immediate parent. I've also included left and right visitation numbers which would really help you with your price lookups. Lastly there is a Level column to store the rank within the hierarchy.

If you're not familiar with visitation numbers please have a look at my answer to this question, where I go into quite a bit of detail.

So why is this better than what you're doing now?

You only track each item in the hierarchy once. Instead of having 4001 show up in the channel column many, many times, you just have one record with 4001.
The structure is flexible and allows for insertion of new levels at a later time. It also allows for uneven trees, say for example ZONE 3001 is subdivided into three SUBZONEs and that cities are under the subzones there (but go direct to zone outside of 3001).
All of your prices are in exactly one place (which doesn't need any reading logic that goes outside of normal SQL queries. Instead of having to have logic that joins a price to different Entity tables depending on what range the EntityID is in, you have one consistent price retrieval query.
You don't need coded meaning in your identifiers so you won't run into trouble when you get too many entities at any one level, thus breaking your coding scheme.
The price retrieval query is dead simple:


-- Look up the left and right numbers for the store in question:
declare @StoreLeft int
declare @StoreRight int
select @StoreLeft = H.left, @StoreRight = H.right
from HIERARCHY H where H.ID = @DesiredStoreId

-- Look up the lowest level (most granular) price for this store:
select TOP 1
  P.price_amt
, P.currency
from PRICE P
  inner join HIERARCHY H
    on H.Left <= @StoreLeft and H.right >= @StoreRight
  inner join ARTICLE A
    on P.article_id = A.id
where A.id = @TheArticleInQuestion
order by H.Level ASC   -- ASC or DESC depends on how you count levels.

Reverse connect by prior level value for arbitrarily-deep hierarchy

You can't ever query for ROWNUM = 3; ROWNUM 1 gets filtered out, so the next record is now ROWNUM = 1, and so on.

I suspect being worried about controlling the computational depth is premature optimization. Unless you have millions of options being returned for a single user, pulling the highest level out of the data set - or even pulling the highest level for each dataset should be pretty simple. It's not as if you can't fully explore the hierarchy, so the additional work in the SQL engine of prioritizing/ranking each window of rows is pretty light on top of that.

So using aggregate/window functions seems like a natural choice here.

Your SQLFiddle doesn't return the dataset expected, but I've modified the SQL below to show how you can prioritize them, assuming I understand the requirements you're looking for.

Analytics rock.

select menu_level, menu_name, menu_id
from
(
select menu_level, menu_name, menu_id,
       row_number() over (partition by menu_name order by menu_level desc) as menu_priority
from
(
SELECT
    LEVEL AS menu_level,
    t.name AS menu_name,
    t.id AS menu_id
  FROM
    jhs_menu_items t, (
      SELECT
        jmi.id
      FROM
        jhs_users ju
      JOIN jhs_user_role_grants jurg ON
        ju.id = jurg.usr_id
      LEFT OUTER JOIN user_menu_preferences ump ON
        ju.id = ump.jhs_usr_id
      LEFT OUTER JOIN default_menu_preferences dmp ON
        jurg.rle_id = dmp.jhs_rle_id
      JOIN jhs_menu_items jmi ON
        -- Retrieve the user's preferred menu item, failing to the default
        -- if no preference is set.
        jmi.id = coalesce(
          ump.jhs_menu_items_id,
          dmp.jhs_menu_items_id
        )
      WHERE
        ju.username = 'adminname' 
--      AND ROWNUM = 1
      ORDER BY
        dmp.role_priority_sort
    ) menu_preference

  -- Derive the menu hierarchy starting at the user's preference, going back to 
  -- the root menu item.
  START WITH t.id = menu_preference.id
  CONNECT BY PRIOR t.mim_id = t.id
  )
)
 where menu_priority = 1
 order by menu_name

MENU_LEVEL  MENU_NAME   MENU_ID
        3   MenuTab1    1
        2   MenuTab2    2
        1   MenuTab3    3

Best Answer

Related Solutions

Modelling Hierarchical attributes

Reverse connect by prior level value for arbitrarily-deep hierarchy

Related Question