Sql-server – Update badly formed XML to allow parsing

sql servert-sqlxml

I have inherited a production system that needs some fine tuning to enable easy extract of the result set. A record gets added to a table which contains an XML field, with contents similar to the below:

<PROG>PROGX_</PROG><STNTYPE>LEAKTEST</STNTYPE><STNID>S1</STNID><TRCODE>SN_REQUEST</TRCODE><OPERATORID>222</OPERATORID><PARTID>PARTID</PARTID><TRSTATUS>0000</TRSTATUS><TRTEXT>TEST OK</TRTEXT><T1_USL>0.45</T1_USL><T1_LSL>-0.45</T1_LSL><T1_RESULT>05,CC/M</T1_RESULT><T1_STATUS>OK</T1_STATUS><OVERALL_RESULT>1</OVERALL_RESULT><OVERALL_STATUS>OK</OVERALL_STATUS>

However, I believe the formatting of the XML is invalid and I want to be able to clean this up a bit, by adding a root and grouping this a bit better. Something similar to along the lines of:

<RESULTSET>
    <PROG>PROGX_</PROG>
    <STNTYPE>LEAKTEST</STNTYPE>
    <STNID>S1</STNID>
    ...etc...
    <OVERALL_STATUS>OK</OVERALL_STATUS>
</RESULTSET>

(The idea being then that I can use SQL views to then properly split this).

Any ideas if this is possible? I'm new to XML on SQL.

Best Answer

You can get this information on SQL Server without adding a root element:

create table tbl (foo xml);

insert into tbl values ('<PROG>PROGX_</PROG><STNTYPE>LEAKTEST</STNTYPE><STNID>S1</STNID><TRCODE>SN_REQUEST</TRCODE><OPERATORID>222</OPERATORID><PARTID>PARTID</PARTID><TRSTATUS>0000</TRSTATUS><TRTEXT>TEST OK</TRTEXT><T1_USL>0.45</T1_USL><T1_LSL>-0.45</T1_LSL><T1_RESULT>05,CC/M</T1_RESULT><T1_STATUS>OK</T1_STATUS><OVERALL_RESULT>1</OVERALL_RESULT><OVERALL_STATUS>OK</OVERALL_STATUS>
');

select 
    foo.value('(/PROG)[1]', 'nvarchar(100)') PROG,
    foo.value('(/STNTYPE)[1]', 'nvarchar(100)') STNTYPE,
    foo.value('(/STNID)[1]', 'nvarchar(100)') STNID
from   tbl;
GO

PROG   | STNTYPE  | STNID
:----- | :------- | :----
PROGX_ | LEAKTEST | S1

db<>fiddle here

Related Solutions

Parse Typed XML with Default Namespace and Inconsistent Root Node in SQL Server

The three problems with your XQuery that I can see are (and none of this has anything to do with it being Typed XML):

You are not specifying the correct path to the /root node. It should be:

SELECT t.c.value('(.)[1]','varchar(50)') as type
from @ixml.nodes('/WebService/NewLeads/OutputSchema/root/item/transaction') as t(c)

XML is case-sensitive so you need to use an upper-case "T" for the "Transaction" node:

SELECT t.c.value('(.)[1]','varchar(50)') as type
from @ixml.nodes('/WebService/NewLeads/OutputSchema/root/item/Transaction') as t(c)

While those fixes would probably get you 1 value back (in this case, it should be "SaleOnly"), it won't iterate through the "item" nodes because you were too specific in the path supplied to the .nodes() function. Instead, the final node of that specification should be "item", which is what you want to iterate through. And in that case, you move the "Transaction" part up to the .value() function:
```
SELECT t.c.value('(./Transaction)[1]','varchar(50)') as type
from @ixml.nodes('/WebService/NewLeads/OutputSchema/root/item') as t(c)
```

Regarding the statement of:

I work on xml both missing and containing second xmlns declaration in root element, as well as with structure trimmed down to just root.

The "structure trimmed down to just root" should be handleable by removing everything between the first / and the / right before /root... (i.e. the WebService/NewLeads/OutputSchema). So the resulting path would be:

from @ixml.nodes('//root/item') as t(c)

NOTE:
I am not able to get this to work 100% with the namespace declared in the <WebService> element (please see additional notes as this is no longer the case). Taking that out makes it work. Giving it a prefix, such as xmlns:something="http://www.orbis-software.com/WebSvcCon" makes it work, but then that needs to be declared in the methods. The only way I can get it to work right now is by declaring the default namespace in each XML function (.nodes and .value) as follows:

SELECT t.c.value('declare default element namespace "http://www.orbis-software.com/WebSvcCon";
                  (./Transaction)[1]','varchar(50)') as [type]
from @ixml.nodes('declare default element namespace "http://www.orbis-software.com/WebSvcCon";
                  /WebService/NewLeads/OutputSchema/root/item') as t(c)

NOTE 2:
Even better, you can use WITH XMLNAMESPACES to declare one or more namespaces to use for the entire query, so no need to define in each XML function. The following both work:

;WITH XMLNAMESPACES (DEFAULT 'http://www.orbis-software.com/WebSvcCon')
SELECT t.c.value('(./Transaction)[1]','varchar(50)') as [type]
from @ixml.nodes('/WebService/NewLeads/OutputSchema/root/item') as t(c)


;WITH XMLNAMESPACES (DEFAULT 'http://www.orbis-software.com/WebSvcCon')
SELECT t.c.value('(./Transaction)[1]','varchar(50)') as [type]
from @ixml.nodes('//root/item') as t(c)

However, just keep in mind that if the document is missing the <WebService xmlns="http://www.orbis-software.com/WebSvcCon"> element and hence has no default namespace, then you need to remove the ;WITH XMLNAMESPACES part. Of course, if the <root> element has its own default namespace, then maybe you will need to keep it. You can play around with it until it works, now that you know the connection between these pieces.

NOTE 3:
If you do end up having two default namespaces declared -- one in the <WebService> element and one in the <root> element -- then you need to specify the URI noted in <root xmlns="bob"> and the // syntax instead of the fully-qualified path. So if your XML looked like:

<WebService xmlns="http://www.orbis-software.com/WebSvcCon">
 <NewLeads>
  <OutputSchema>
  <root xmlns="http://someplace" type="array">

You would then use:

;WITH XMLNAMESPACES (DEFAULT 'http://someplace')
SELECT t.c.value('(./Transaction)[1]','varchar(50)') as [type]
from @ixml.nodes('//root/item') as t(c)

But that won't help if you do have the <WebService> element and yet the <root> element is missing the xmlns declaration. In that case you still need to specify the namespace noted in the <WebService> element. Fun, fun, fun :-).

NOTE 4:
Even more better: incorporating something mentioned in @wBob's answer, we can actually get rid of the ;WITH XMLNAMESPACES clause, and instead use a namespace wildcard. You just need to prefix every node of every XML function with *:. Now the query should look as follows:

SELECT t.c.value('(./*:Transaction)[1]','varchar(50)') AS [type],
       t.c.value('(./*:SaleProperty/*:PostDistrict)[1]','varchar(50)') AS [PostDistrict]
FROM  @ixml.nodes('//*:root/*:item') t(c);

Doing this means that the query works in all of your scenarios:

Full structure starting with "WebService" node, second xmlns declaration:

<WebService xmlns="http://www.orbis-software.com/WebSvcCon">
  <NewLeads>
    <OutputSchema>
      <root xmlns="uri" type="array">

Full structure starting with "WebService" node, single xmlns declaration:

<WebService xmlns="http://www.orbis-software.com/WebSvcCon">
  <NewLeads>
    <OutputSchema>
      <root type="array">

Trimmed down structure starting with "root" node, single xmlns declaration:
```
<root xmlns="uri" type="array">
```

Sql-server – XML parsing slowly

You could try OPENXML. Now I don't normally recommend OPENXML as it has a number of well-known memory issues (basically it can take 1/8th of your buffer pool depending on the size of XML)(!!TODO add link). However legend has it, it is faster for larger pieces of XML, so it's worth a try in a dev/test environment and if you know the memory issues, and you get the performance, it's up to you to decide which you need most. Something like this:

DECLARE @handle INT

EXEC sp_xml_preparedocument @handle OUTPUT, @RemarksXml

INSERT INTO #ChangeSet
SELECT RemarkTypeID, RemarkText, @ListingID
FROM OPENXML( @handle, '/Remarks[1]/Remark', 1 ) 
WITH ( 
    RemarkTypeID    TINYINT,
    RemarkText      VARCHAR(2500)
    )

EXEC sp_xml_removedocument @handle

Remember to always call sp_xml_removedocument. I am guessing your real XML is much larger, can you give us an idea how many Remark elements it will have and what the size is in KB / MB. I will come back to the post later and set up a test rig to compare performance based on your stats.

UPDATE: According to your example script, your XML can only have a max of 256 Remark elements with a max length of 2500. Having created some sample XML to meet these criteria and tested it, there is little in performance between the two techniques, and both finish in sub-second. In this case I would choose the XML datatype and methods. Can you provide the other information requested please?

Best Answer

Related Solutions

Parse Typed XML with Default Namespace and Inconsistent Root Node in SQL Server

Sql-server – XML parsing slowly

Related Question