Sql-server – Import XML data while creating IDs with relationships

importsql serverxml

I need to parse XML from a file into tables while retaining their relationships through custom-generated ID values.

For example, if I had the following XML:

<root>
   <construction>
      <constructionName>randomname1</constructionName>
      <project>
         <projectname>another randomname</projectname>
         <businesspartners>
            <partnername>bilbo bagginses</partnername>
         </businesspartners>
         <employees>
            <employee>
               <empname>frodo</empname>
               <empaddress>etc...</empaddress>
            </employee>
         </employees>
      </project>
      <info>
         <randElement></randElement>
      </info>
      <constructionType>houses</constructiontype>
   </construction>
   <construction>
      <...(etc, same as above, times n^10)/>
   </construction>
</root>

From that, I need to generate data to the following tables:

CONSTRUCTION (CONSTRUCTION_ID INT PRIMARY KEY
   , CONSTRUCTIONNAME VARCHAR..
   , CONSTRUCTIONTYPE VARCHAR.. )

PROJECT (PROJECT_ID INT PRIMARY KEY
   , CONSTRUCTION_ID INT FOREIGN KEY REFERENCES CONSTRUCTION
   , PROJECTNAME VARCHAR.., )


BUSINESSPARTNERS (BUSINESSPARTNERS_ID INT PRIMARY KEY
   , PROJECT_ID INT FOREIGN KEY REFERENCES PROJECT
   , PARTNERNAME VARCHAR..)

etc...

Basically, the idea is to construct tables with their references intact to represent the XML. The table structures already exist and cannot be changed to accommodate this script. This is simply a matter of being able to do the same kind of XML parsing and then adding the data to the tables while generating the correct referencing ID values, as we did before with integration. Only doing this exclusively with SQL Server, and without using SSIS.

Now, the actual scenario and the file in question is rather massive, so I do by no means expect a full answer. Just tips on where to start looking. I'm quite inexperienced with dealing with XML.

For now, my preferred solution would be just importing the data with OPENROWSET, and then using dynamic SQL with OPENXML to parse the document into the tables by basically looping the elements and their sub-elements one at a time. But that seems a lot more trouble than some other, smarter way.

How are the ID values generated?

That's kind of part of the question. For now, they aren't. The idea is that each CONSTRUCTION element would be assigned an ID increasing from 1. Then all the children of CONSTRUCTION would then refer to the same ID assigned to the parent, and so on. Basically, it's just about dividing the data in the XML into multiple tables while keeping the referential integrity intact.

I would have no problem using an identity column as a primary key, so long as the relations aren't mixed up as a result. I don't know how to go about doing this, so I assumed one would have to somehow manually define the logic for creating the ID's, as opposed to identity?

Best Answer

You can use a variant of the technique described in Using merge..output to get mapping between source.id and target.id and by Adam Machanic in Dr. OUTPUT or: How I Learned to Stop Worrying and Love the MERGE.

You use merge and capture the generated ID with the XML fragment that belongs to that ID in a table variable and then use that table variable when adding rows to the child tables.

declare @C table
(
  CONSTRUCTION_ID int primary key,
  PROJECT xml
);

merge CONSTRUCTION as T
using (
      select T.X.value('(constructionName/text())[1]', 'varchar(30)') as CONSTRUCTIONNAME,
             T.X.value('(constructionType/text())[1]', 'varchar(30)') as CONSTRUCTIONTYPE,
             T.X.query('project') as PROJECT
      from @xml.nodes('/root/construction') as T(X)
      ) as S
on 0 = 1
when not matched by target then
  insert (CONSTRUCTIONNAME, CONSTRUCTIONTYPE) 
  values (S.CONSTRUCTIONNAME, S.CONSTRUCTIONTYPE)
output inserted.CONSTRUCTION_ID, S.PROJECT into @C;

declare @P table
(
  PROJECT_ID int primary key,
  BUSINESSPARTNERS XML
);

merge PROJECT as T
using (
      select C.CONSTRUCTION_ID,
             T.X.value('(projectname/text())[1]', 'varchar(30)') as PROJECTNAME,
             T.X.query('businesspartners') as BUSINESSPARTNERS
      from @C as C
        cross apply C.PROJECT.nodes('/project') as T(X)
      ) as S
on 0 = 1
when not matched by target then
  insert (CONSTRUCTION_ID, PROJECTNAME)
  values(S.CONSTRUCTION_ID, S.PROJECTNAME)
output inserted.PROJECT_ID, S.BUSINESSPARTNERS into @P;

insert into BUSINESSPARTNERS(PROJECT_ID, PARTNERNAME)
select P.PROJECT_ID,
       T.X.value('text()[1]', 'varchar(30)')
from @P as P
  cross apply P.BUSINESSPARTNERS.nodes('/businesspartners/partnername') as T(X);

SQL Fiddle