SQL Server – Find Missing Folder Paths in Split Hierarchies

hierarchyperformancequery-performancesql serversql-server-2012

I have a table which contains folders' paths. This table contains four columns:

  • DirID – The folder's ID.
  • BaseDirID – The ID of the first folder in the hierarchy. So all the folders (paths) from the same hierarchy share the same value in this column.
  • DirLevel – The depth of the folder.
  • DisplayPath – The folder's path.

I need to find all the "gaps" between those folders in the hierarchy.

Sample data for example:

DirID BaseDirID DirLevel DisplayPath
1         1        1     'A'
2         1        3     'A\B\C'
3         1        5     'A\B\C\D\E'
4         1        3     'A\B\F'
5         1        5     'A\B\F\G\H'
6         2        1     'U'
7         2        3     'U\V\W'
8         2        5     'U\V\W\X\Y'
9         2        3     'U\V\M'
10        2        5     'U\V\M\L\O'

So we need to find the following data:

BaseDirID   DisplayPath
1           'A\B'
1           'A\B\C\D'
1           'A\B\F\G'
2           'U\V'
2           'U\V\W\X'
2           'U\V\M\L'

Comments:

  1. This table contains more than 250,000 records of folders, so we seek for the most efficient way to do so, otherwise, the script will be stuck for a long time, the time we don't have.
  2. I don't have the list of all folders. What I have are the "root" folders and the "leafs" folders which I need to find the "gaps" between them in the hierarchy.
  3. The table can contain more than one hierarchy and we need to find the "gaps" in all of the hierarchies.
  4. Each of the hierarchies can split, as you can see in the sample data the first hierarchy splits to two folders paths from the 'A\B' folder:
    'A\B\C' and 'A\B\F'.
    And the second hierarchy splits to two folders paths from the 'U\V' folder: 'U\V\W' and 'U\V\M'.
    And we need to find all the "gaps" even in such cases when the hierarchies split.
  5. We can make any changes to the table – add pk, indexes etc…
  6. The SQL Server version is 2012 SP3.
  7. The real folder names could be anything – one character or more.
  8. The results must only contain the "gaps".
  9. All missing intermediate folders are required.

Extended sample data and expected results in this dbfiddle.

This question is a continuation of the Stack Overflow question Find missing hierarchy Folders (Paths) in a table. Our question includes also the 4th comment which appears in bold.

I saw that there is a new type called hierarchyid (starting from SQL Server 2008), which I thought might help us. What do you think?

Best Answer

I'd try something like this:

  • First, create a new table and copy the existing data there.
  • Then iterate, and in each iteration, remove the last node of each filepath (e.g.: 'A\B\C\D\E' becoming 'A\B\C\D') and add these new filepaths in the table, if they aren't already there.
  • Stop when an iteration doesn't produce new rows.

The initialization:

-- step 0
CREATE TABLE filepaths
  ( BaseDirID int NOT NULL,
    DirLevel int NOT NULL,
    DisplayPath varchar(1000) NOT NULL,   -- adjust the size according to your data,
    ReverseDisplayPath varchar(1000) NOT NULL,
    Iteration int NOT NULL,
    PRIMARY KEY (Iteration, ReverseDisplayPath),
    UNIQUE (ReverseDisplayPath)
  ) ;

INSERT INTO filepaths
  ( BaseDirID, DirLevel, DisplayPath, ReverseDisplayPath, Iteration )
SELECT
    BaseDirID, DirLevel, DisplayPath, reverse(DisplayPath), 0
FROM 
    existing_table ;

And the iterations:

DECLARE @new_items bigint ;
DECLARE @iter int ;
SET @iter = 0 ;

-- repeat
repeat:     
    INSERT INTO filepaths
        (BaseDirID, DirLevel, DisplayPath, ReverseDisplayPath, Iteration)
    SELECT DISTINCT
        f.BaseDirID, f.DirLevel - 1, reverse(r.rdp), r.rdp, @iter + 1
    FROM 
        filepaths AS f
      CROSS APPLY
        ( SELECT substring(f.ReverseDisplayPath, 
                           1 + charindex('\', f.ReverseDisplayPath), 
                           1000) AS rdp
        ) AS r
    WHERE 
          f.Iteration = @iter
      AND f.DirLevel > 1 
      AND NOT EXISTS
          ( SELECT *
            FROM filepaths AS ex
            WHERE ex.ReverseDisplayPath = r.rdp
          ) ;
    SET @new_items = @@ROWCOUNT ;

    SET @iter = @iter + 1 ;
    -- until new_items = 0
    IF (@new_items > 0) GOTO repeat; 

After the process is finished, you can get only the missing filepaths with a simple query:

SELECT DisplayPath FROM filepaths 
WHERE Iteration > 0 
ORDER BY DisplayPath ;

Tested in dbfiddle.uk.

The reverse strings are not strictly needed but they simplify the splitting of the strings. You could go without them (and a bit more complex code). You could also have only the reverse paths stored in the new filepaths table and when the iterations are complete, reverse them again into the target table (the existing one I guess or wherever you want them to go).

The whole thing could also be done with a single pass - using a recursive CTE - but I don't know if that would be more or less efficient.