Sql-server – XQuery in SQL Server to convert XML column data to relational data table

sql-server-2008xmlxquery

I previously asked a question about an error I was recieving. You don't really need it to understand this question but it's here for reference:

XML/SQL Server 2008 Error: XQuery…Cannot implicitly atomize or apply 'fn:data()' to complex content elements

The previous xml is a little complex and probably would benefit from a transformation so I applied an XSLT template to get the below structure and changed the tags a little so its more understandable. I've also restructured the table I'm importing to, for maintainability. I imported the transformed XML file to a SQL Server table, xTable, with column xData, Like so (only one row but I suppose you could import more than 1 and merge them all with David Browne's answer):

ID    xData
1     <MyXMLFile><Sample><Location>....

The parent node of the xml, <Sample>, can be repeated up to 1 million times but for this illustration, I only have 2. There are 22 child nodes for each sample, one <SampleID> node and 21 <Location> nodes (I've only shown 2 nodes to keep things short). There are 3 child nodes for each node, one <LocationName> node and two <Foo> nodes, designated <Foo1> and <Foo2>.

<?xml version="1.0" encoding="UTF-16"?>
<MyXMLFile>
    <!--There CAN BE up to 1 million <Sample> nodes-->
    <Sample>
        <!--There ARE EXACTLY 22 child nodes for each <Sample> parent node, one <SampleID> and 21 <Location>-->
        <SampleID>0000001A</SampleID>
        <!--There ARE EXACTLY 3 child nodes for each <Location> parent node, on <LocationID> and two <Foo>-->
        <Location>
            <LocationName>Jeff</LocationName>
            <Foo1>10</Foo1>
            <Foo2>11</Foo2>
        </Location>
        <Location>
            <LocationName>Jenn</LocationName>
            <Foo1>11</Foo1>
            <Foo2>12</Foo2>
        </Location>
    </Sample>
    <Sample>
        <SampleID>0000002A</SampleID>
        <Location>
            <LocationName>Greg</LocationName>
            <Foo1>13</Foo1>
            <Foo2>14</Foo2>
        </Location>
        <Location>
            <LocationName>Anne</LocationName>
            <Foo1>14</Foo1>
            <Foo2>16</Foo2>
        </Location>
    </Sample>
</MyXMLFile>

I want to convert the xData column from xTable and put it into this table (ID column for illustration only):

ID      SampleID    LocationName    Foo1   Foo2
1       00000001    Jeff            10     11     
2       00000001    Jenn            11     12     
…       00000001    …               …      …            
22      00000001    …               …      …     
23      00000002    Greg            13     14    
24      00000002    Anne            17     18
…       00000002    …               …      …
44      00000002    …               …      …

At the moment, I'm just trying to SELECT the xData column from xTable and will edit the query later to insert the data. So my first query, just to show that <SampleID> does get selected:

Query 1

SELECT  a.b.query('SampleID').value('.', 'varchar(20)') AS SampleID

FROM xTable

CROSS APPLY xData.nodes('MyXMLFile/Sample') as a(b)

The output looks good:

ID      SampleID
1       00000001
2       00000002

So, I added to the query:

Query2

SELECT  a.b.query('SampleID').value('.', 'varchar(20)') AS SampleID,
        a.b.query('LocationName').value('.', 'varchar(10)') AS LocationName,
        a.b.query('Foo1').value('.', 'varchar(6)') AS Foo1,
        a.b.query('Foo2').value('.', 'varchar(6)') AS Foo2

FROM xTable
CROSS APPLY xData.nodes('MyXMLFile/Sample/SampleID/../Location') as a(b)

For this output, no data gets selected for <SampleID>. This is not surprising to me as the xpath selection is only to the <Location> parent node and returns its children <LocationName>, <Foo1> and <Foo2> and not <SampleID>.

ID      SampleID    LocationName    Foo1   Foo2
1                   Jeff            10     11     
2                   Jenn            11     12     
…                   …               …      …            
22                  …               …      …     
23                  Greg            13     14    
24                  Anne            17     18     
…                   …               …      …
44                  …               …      …

So then I tried this:

Query 3

SELECT  a.b.query('SampleID').value('.', 'varchar(20)') AS SampleID,
        c.d.query('LocationName').value('.', 'varchar(10)') AS LocationName,
        c.d.query('Foo1').value('.', 'varchar(6)') AS Foo1,
        c.d.query('Foo2').value('.', 'varchar(6)') AS Foo2
FROM xTable
CROSS APPLY xData.nodes('MyXMLFile/Sample/SampleID') as a(b)
CROSS APPLY xData.nodes('MyXMLFile/Sample/SampleID/../Location') as c(d)

The output is a little better but the rows are duplicated in the table. There should only be 44, but there are 88:

ID      SampleID    LocationName    Foo1   Foo2
1       00000001    Jeff            10     11     
2       00000001    Jenn            11     12     
…       00000001    …               …      …            
42      00000001    …               …      …     
43      00000001    …               …      …
44      00000001    …               …      …
45      00000002    Greg            13     14    
46      00000002    Anne            17     18
…           …       …               …      …
88      00000002    …               …      …

Then I thought I would try a different way.

Query 4

DECLARE @x xml;
SELECT @x = xData
FROM xTable
SELECT a.b.value('(SampleID/text())[1]', 'varchar(20)') AS SampleID,
       a.b.value('(LocationName/text())[1]', 'varchar(10)') AS LocationName,
       a.b.value('(Foo1/text())[1]', 'varchar(6)') AS Foo1,
       a.b.value('(Foo2/text())[1]', 'varchar(6)') AS Foo2

FROM @x.nodes('MyXMLFile/Sample') AS xData(a)
CROSS APPLY @x.nodes('MyXMLFile/Sample/SampleID/../Location') AS a(b)

Now, instead of blank SampleID field or duplicated records, SampleID came back NULL and the data was duplicated:

ID      SampleID    LocationName    Foo1   Foo2
1       NULL        Jeff            10     11     
2       NULL        Jenn            11     12     
…       NULL        …               …      …            
42      NULL        …               …      … 
43      NULL        …               …      …
44      NULL        …               …      …
45      NULL        Greg            13     14    
46      NULL        Anne            17     18
…       NULL        …               …      …
88      NULL        …               …      …

So in a final attempt to select the right data, I tried this query:

Query 5

DECLARE @x xml;
SELECT @x = xData
FROM xTable
SELECT a.b.value('(SampleID/text())[1]', 'varchar(20)') AS SampleID,
       c.d.value('(LocationName/text())[1]', 'varchar(10)') AS LocationName,
       c.d.value('(Foo1/text())[1]', 'varchar(6)') AS Foo1,
       c.d.value('(Foo2/text())[1]', 'varchar(6)') AS Foo2

FROM @x.nodes('MyXMLFile/Sample') AS xData(a)
CROSS APPLY @x.nodes('MyXMLFile/Sample') AS a(b)
CROSS APPLY @x.nodes('MyXMLFile/Sample/SampleID/../Location') AS c(d)

The result here is even more surprising to me, not only did the query populate all the fields but it
quadrupled the output:

ID      SampleID    LocationName    Foo1   Foo2
1       00000001    Jeff            10     11     
2       00000001    Jenn            11     12     
…       00000001    …               …      …            
…       00000001    …               …      …     
…       00000001    …               …      …    
44      00000001    …               …      …
45      00000002    Greg            13     14   
46      00000002    Anne            17     18
47      00000002    …               …      …
48      00000002    …               …      …
…           …       …               …      …
176     00000002    …               …      …

I understand my problem to be incorporation of the two different xpaths into the query and my understanding and use of the derived tables in the query. Any help would be appreciated. How can I adjust these queries to get the table I need?

Thanks in advance.

EDIT:
At the advice of the answer of David Browne this works for me:

Query 6

INSERT INTO MyTable (SampleID, LocationName, Foo1, Foo2)
SELECT Sample.n.value('(SampleID)[1]', 'varchar(20)') AS SampleName,
       Location.n.value('(LocationName/text())[1]', 'varchar(1)') AS LocationName,
       Location.n.value('(Foo1/text())[1]', 'varchar(6)') AS Foo1,
       Location.n.value('(Foo2/text())[1]', 'varchar(6)') As Foo2
FROM xTable AS x
CROSS APPLY x.xData.nodes('/MYXMLFile/Sample') AS Sample(n)
CROSS APPLY Sample.n.nodes('Location') AS Location(n)

Best Answer

The pattern is that each cross apply picks up the relative location of the parent. Try something like this:

declare @doc xml =N'<?xml version="1.0" encoding="UTF-16"?>
<MyXMLFile>
    <!--There CAN BE up to 1 million <Sample> nodes-->
    <Sample>
        <!--There ARE EXACTLY 22 child nodes for each <Sample> parent node, one <SampleID> and 21 <Location>-->
        <SampleID>0000001A</SampleID>
        <!--There ARE EXACTLY 3 child nodes for each <Location> parent node, on <LocationID> and two <Foo>-->
        <Location>
            <LocationName>Jeff</LocationName>
            <Foo1>10</Foo1>
            <Foo2>11</Foo2>
        </Location>
        <Location>
            <LocationName>Jenn</LocationName>
            <Foo1>11</Foo1>
            <Foo2>12</Foo2>
        </Location>
    </Sample>
    <Sample>
        <SampleID>0000002A</SampleID>
        <Location>
            <LocationName>Greg</LocationName>
            <Foo1>13</Foo1>
            <Foo2>14</Foo2>
        </Location>
        <Location>
            <LocationName>Anne</LocationName>
            <Foo1>14</Foo1>
            <Foo2>16</Foo2>
        </Location>
    </Sample>
</MyXMLFile>'

drop table if exists #xData;

with q as
(
    select 1 ID, @doc xData
    union all 
    select 1 ID, @doc xData
)
select *
into #xData
from q

SELECT Sample.n.value('(SampleID)[1]', 'varchar(20)') AS SampleID,
       Location.n.value('(LocationName/text())[1]', 'varchar(10)') AS LocationName,
       Location.n.value('(Foo1/text())[1]', 'varchar(6)') AS Foo1,
       Location.n.value('(Foo2/text())[1]', 'varchar(6)') AS Foo2

FROM #xData x
cross apply x.xData.nodes('/MyXMLFile/Sample') AS Sample(n)
cross apply Sample.n.nodes('Location') as Location(n)

outputs

SampleID             LocationName Foo1   Foo2
-------------------- ------------ ------ ------
0000001A             Jeff         10     11
0000001A             Jenn         11     12
0000002A             Greg         13     14
0000002A             Anne         14     16
0000001A             Jeff         10     11
0000001A             Jenn         11     12
0000002A             Greg         13     14
0000002A             Anne         14     16

(8 rows affected)

Related Solutions

Sql-server – Item Index for all items in nested XML query

I found the answer - I needed to create a CTE that uses a union of all my child and parent records and creates a ROW_NUMBER(), then JOIN to that CTE to get the ROW_NUMBER() value which will be unique across all records.

Solution fiddle here.

Full solution to paste into SSMS:

BEGIN TRY
    DROP TABLE #Child
    DROP TABLE #Parent
END TRY
BEGIN CATCH
END CATCH

CREATE TABLE #Parent 
    (RecId int PRIMARY KEY NOT NULL, 
     PersonName varchar(100), Age int)
CREATE TABLE #Child 
    (ChildID int identity PRIMARY KEY NOT NULL,
     ParentRecId int FOREIGN KEY REFERENCES #Parent(RecId), 
     SalesAmt money)

INSERT INTO #Parent
  (RecID, PersonName, Age)
VALUES
  (1, 'Aaron Bertrand', 99),
  (2, 'Paul White', 20),
  (3, 'JNK', 33)

INSERT INTO #Child
  (ParentRecID, SalesAmt)
VALUES
  (1, 10.00),
  (1, 20.00),
  (2, 15.15),
  (2, 100.00),
  (3, 0.00)

;WITH IDs AS
(
    SELECT
       RN = (ROW_NUMBER() OVER (ORDER BY  RecId,CASE WHEN ChildId IS NULL THEN 0 ELSE 1 END) -1),
       RecId,
       ChildId
    FROM
       (
       SELECT
          RecId, ChildId = NULL
       FROM
          #Parent
       UNION ALL
       SELECT
          RecId, ChildId
       FROM 
          #Parent P
       INNER JOIN
          #Child C
              ON C.ParentRecId = P.RecId) x
)

SELECT
  P.RecId as 'RID',
  P.PersonName as 'PNAM',
  P.Age,
  I.RN as 'Index',
  (
    SELECT 
      C.SalesAmt as 'SAMT',
      I.RN as 'Index'
    FROM
      #Child C
    INNER JOIN
       IDs I
          ON I.ChildId = C.ChildID
    WHERE
      C.ParentRecId = P.RecId
    FOR XML PATH ('ChildRec'), ROOT ('ChildRecs'), TYPE
  )
FROM
  #Parent P
INNER JOIN
    IDs I
       ON I.RecId = P.RecId
       AND I.ChildId IS NULL
FOR XML PATH ('Parent'), ROOT ('Parents'), TYPE

Sql-server – Query to search for a substring in xml

To know what performance you will have you have to test on your data. I obviously can't do that so I made up my own xml data to test the two queries you have in this question.

Create a table with 5000 rows containing an XML document of 9475 characters in 415 nodes:

create table T
(
  ID int identity primary key,
  XMLCol xml not null
)

declare @X xml = 
(
  select top 100 *
  from master..spt_values
  for xml path('row'), root('root'), type
)

insert into T(XMLCol)
select top(5000) @X
from master..spt_values as m1, master..spt_values as m2

Execute the queries to search for a value that is present in the first node (rpc) and another value that is present in the last node (SERVER ROLE).

select count(*)
from T
where charindex('rpc',cast(xmlcol as varchar(max))) > 0

select count(*)
from T
where XMLCol.exist('//*/text()[contains(., "rpc")]') = 1

select count(*)
from T
where charindex('SERVER ROLE',cast(xmlcol as varchar(max))) > 0

select count(*)
from T
where XMLCol.exist('//*/text()[contains(., "SERVER ROLE")]') = 1

The IO for the different queries is the same so here is the output from using set statistics time on

Search for rpc with charindex:

 SQL Server Execution Times:
   CPU time = 1435 ms,  elapsed time = 1434 ms.

Search for rpc with xml exist

 SQL Server Execution Times:
   CPU time = 63 ms,  elapsed time = 68 ms.

Search for SERVER ROLE with charindex

 SQL Server Execution Times:
   CPU time = 7316 ms,  elapsed time = 7321 ms.

Search for SERVER ROLE with xml exist

 SQL Server Execution Times:
   CPU time = 3245 ms,  elapsed time = 3244 ms.

Clear winner in both cases is the XML query. It does a better job of scanning the entire XML and it does a much better job of early termination when the search string is found.

This is true for the test data above using SQL Server 2012. It could be different for you with your data and your search strings. You have to test to know what is best for you.

NOTE: As stated in the answer to your other question, the two queries above will not return the same result bucause the XML query only search node values where the charindex query searches the entire XML document including nodenames and markup.

Best Answer

Related Solutions

Sql-server – Item Index for all items in nested XML query

Sql-server – Query to search for a substring in xml

Related Question