SQL Server – How to Take a Slice of a Database for Specific Rows

backupexportscriptingsql servertools

For audit purposes, I'm trying to obtain a copy/export of a DB but including only the rows from each table that have relevance to a small collection of rows in some table. So every row of every table that these rows point to; every row of every table that points to these rows; and so forth. (In algebra terms this is the transitive-symmetric-reflexive closure of those rows with respect to the foreign key relation.)

I'm wondering if there exists some kind of tool to do this, or if this is scriptable somehow. I have ~400 tables with a tight network of foreign keys so I'd rather not do this by hand.

Best Answer

You could certainly script it with a nested stored procedure:

  1. To be called as something like EXEC recursive_shaped_query @StartTable='SomeTable', @KeyColumn='Id', @ValuesSQL='SELECT Id FROM (VALUES (1),(6),(4576),(42)) AS Ids(Id)'
  2. This then runs via ad-hoc SQL SELECT * FROM SomeTable WHERE Id IN (SELECT Id FROM (VALUES (1),(6),(4576),(42)) AS Ids(Id))
  3. It then cursors through sys.foreign_keys to find any tables that reference this table and…
  4. … for each such table found calls itself like EXEC recursive_shaped_query @StartTable='ReferencingTable', @KeyColumn='Id', @ValuesSQL='SELECT Id FROM SomeTable WHERE Id IN (SELECT Id FROM (VALUES (1),(6),(4576),(42)) AS Ids(Id))'

Of course there would be a number of complications to consider:

  • You will need to decide what to do if your data structure allows for circular links, like a table containing state machine graphs, or places where two entities might refer to each other, otherwise this will loop to infinity (well, until you hit the "procedure calls nested too deep" error).
  • You may need to deal with foreign keys acting on multiple fields instead of just one.
  • The SELECT * FROM t1 WHERE Id IN (SELECT Id FROM t2 WHERE Id IN (SELECT Id FROM Tt3 WHERE ... ))) nesting may get silly long and be queries that take an age to run due to table scans if useful indexes are not in place.

Other things to note:

  • You might find using joins instead of sub-selects produces more efficient queries (though I think the query planner will do a decent job of optimising the two syntax options to be equivalent).
  • You might prefer the procedure just emits the ad-hoc SQL needed instead of running it, so you can run it later (and rerun it again elsewhere).
  • If just emitting the ad-hoc SQL and not running it, a recursive CTE over sys.foreign_keys may be noticeably cleaner than a nested stored procedure - in fact doing that might be better anyway: generate the SQL in on go then run it in one go, perhaps explicitly dropping the data to temp tables then ending the current transaction to release locks before pushing the data out to the reader.
  • This is all of course assuming that you have foreign keys defined everywhere they should be. In too many databases out there some or all of these constraints are missing.

For our systems in [Day Job] I have a set of such queries manually created to pull out nested data for given entities as these procedures are useful for diagnostics, but I've never needed nearly enough of them to warrant trying to create a generator or other automated solution. I've thought about it as a play project, but not actually done it.