Sql-server – How to shred .docx XML

sql serversql-server-2008xml

I am trying to import a xml (actually a docx file) to a sql server 2008 database. I am almost a novice in XML programming. I googled a lot but almost all the examples there are with simple xml file. Here the xml file is little complex(please see below). Can you please give me some idea how I should create the table for this XML and what query should I run in sql server. I need values for all the tags e.g. w:rsidP,w:rsidRDefault,w:rsidR of w:p,w:pStyle,w:bookmarkStart,w:t tags etc.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
<w:body>
<w:p w:rsidR="00EF42E0" w:rsidRDefault="00EF42E0" w:rsidP="00EF42E0">
<w:pPr><w:pStyle w:val="Heading1"/>
</w:pPr><w:bookmarkStart w:id="0" w:name="_Toc212523610"/>
<w:r>
<w:t>Summary</w:t>
</w:r>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:p w:rsidR="00EF42E0" w:rsidRDefault="00EF42E0" w:rsidP="00EF42E0"><w:pPr><w:pStyle w:val="mainbodytext"/><w:ind w:right="-694"/><w:rPr><w:b/><w:bCs/></w:rPr></w:pPr><w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>What is the Group Defined Practice for Integrity Management?</w:t></w:r></w:p>
<w:p w:rsidR="00EF42E0" w:rsidRDefault="00EF42E0" w:rsidP="00EF42E0"><w:pPr><w:pStyle w:val="mainbodytext"/></w:pPr><w:r><w:t xml:space="preserve">This Practice is derived from the GP Group Standard, GRP 01 January 2006, </w:t></w:r><w:proofErr w:type="gramStart"/><w:r><w:t>Integrity</w:t></w:r><w:proofErr w:type="gramEnd"/><w:r><w:t xml:space="preserve"> Management.  In developing QMS it has been possible to embed much of the content of the IM Standard directly into the Group Essentials statements.  For elements 2, 7, 8 and 9 of the Standard it was possible to do that in their entirety and therefore content of those elements are not repeated within this Practice.</w:t></w:r></w:p></w:body></w:document>

Best Answer

When working with XML in SQL Server you use the xml Data Type Methods and when shredding XML documents you typicly use the nodes() and value() methods. The XML you have here also include a number of namespaces so you have to specify the ones you need using WITH XMLNAMESPACES (Transact-SQL).

The XML is quite complex so without knowing how you want the data extracted I can only give you a couple of sample queries that you then can modify to whatever it is you need.

You have four w:p nodes and here is a query that fetches the attributes from those nodes. Using @ specifies that it is the value of an attribute you want-

with xmlnamespaces('http://schemas.openxmlformats.org/wordprocessingml/2006/main' as w)
select P.X.value('@w:rsidR', 'char(8)') as rsidR,
       P.X.value('@w:rsidRDefault', 'char(8)') as rsidRDefault,
       P.X.value('@w:rsidP', 'char(8)') as rsidP
from @doc.nodes('/w:document/w:body/w:p') as P(X);

SQL Fiddle

If you in addition to that want the text in the w:t node you need to do a cross apply to a second nodes() clause that will shred the XML within the w:p node.

with xmlnamespaces('http://schemas.openxmlformats.org/wordprocessingml/2006/main' as w)
select P.X.value('@w:rsidR', 'char(8)') as rsidR,
       P.X.value('@w:rsidRDefault', 'char(8)') as rsidRDefault,
       P.X.value('@w:rsidP', 'char(8)') as rsidP,
       T.X.value('text()[1]', 'nvarchar(max)') as Text
from @doc.nodes('/w:document/w:body/w:p') as P(X)
  cross apply P.X.nodes('w:r/w:t') as T(X);

SQL Fiddle

You said in your question that you want to get the values from all the tags. I don't know how useful this is to you but you can build a Name-Value list with all the attributes and elements in the XML.

This will give you all elements.

select T.X.value('local-name(.)', 'nvarchar(max)') Name,
       T.X.value('.', 'nvarchar(max)') Value
from @doc.nodes('//*') as T(X)

Change '//*' to '//@*' and you will get all attributes.

select T.X.value('local-name(.)', 'nvarchar(max)') Name,
       T.X.value('.', 'nvarchar(max)') Value
from @doc.nodes('//@*') as T(X)

And you can combine them in one query as well.

select T.X.value('local-name(.)', 'nvarchar(max)') Name,
       T.X.value('.', 'nvarchar(max)') Value
from @doc.nodes('//@*, //*') as T(X)

SQL Fiddle