Update 10/30/2018. Just in case anyone is interested, I have started to document at
the end of this post what I'm actually doing to solve this, just
because I've found some interesting things.
Disclaimer! I'm new to a lot of this. I've had fairly limited self taught exposure to SQL server, VBA, XPATH, XSD, XML and a year of formal training in logic (Java). My job has migrated me into a unique position a year or so ago to getting off and on exposure to SQL and XML so I'm still quite a novice. The things I post/ask may be very naïve and if so, I apologize and welcome frank, harsh, criticism, suggestions about function and appropriateness of form and good hard copy references (I don't mean the Microsoft docs because I rarely can figure heads or tails of them). So on to it!
I've been getting this error with a project I'm working on and although there are some posts about it here and elsewhere, nothing seems to be helping. My guess is I'm having an XPATH expression problem…initially.
XQuery [XMLTestTable.DATA.value()]: Cannot implicitly atomize or apply
'fn:data()' to complex content elements, found type 'xs:anyType'
within inferred type
'element({urn:MyFile-schema}:SUBUNITPRICE,xs:anyType) *'.
As you can probably guess this is an XML import/conversion to relational table project. I've found quite a bit on the net about this type of thing but very little explanation of what much of it means (although I have some educated guesses at most of it).
I'll start here. Schemas:
Some of the time people will import a schema first, sometimes they don't and just roll with it. The way I understand schemas, they are for 1: validation of the xml document to some sort of standard to make sure everything runs/imports/exports smoothly 2: potentially increase efficiency of importing/exporting/editing of a file. Although I find that validating the document itself adds time to the import (I assume it's simply extra steps), downstream queries maybe more efficient although I haven't gotten that far. Either way, I thought as a matter of course and practice it was a good idea. So here is my schema (this was hand transcribed so if you see something bad, please yell at me!). Also, I use a node called SUBUNITPRICE, please understand that this isn't actually a monetary unit and I've changed some of the node names to keep things a little more confidential. Just know that this node value is a text value that can contain numbers and symbols.
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs= "http://www.w3.org/2001/XMLSchema" targetNamespace="urn:MyFile-schema" xmlns="urn:MyFile-schema" elementFormDefault="qualified">
<xs:element name="MyFile">
<xs:complexType>
<xs:sequence>
<xs:element name="FIELD1" type="xs:double"/>
<xs:element name="FIELD2" type="xs:string"/>
<xs:element name="FIELD3" type="xs:string"/>
<xs:element name="FIELD4" type="xs:string"/>
<xs:element name="FIELD5" type="xs:string"/>
<xs:element name="FIELD6" type="xs:dateTime"/>
<xs:element name="GROUP" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="GROUPID" type="xs:string"/>
<xs:element name="GROUPCATEGORY" type="xs:string"/>
<xs:element name="UNIT" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name=" UNITNAME" type="xs:string"/>
<xs:element name="REVIEWER" type="xs:string"/>
<xs:element name="DATEANDTIME" type="xs:dateTime"/>
<xs:element name="SUBUNIT" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name=" SUBUNITPRICE" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="MULTIENTRY" type="xs:string"/>
<xs:attribute name="PARTIALUNIT" type="xs:string"/>
<xs:attribute name="KIT" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="FULL" type="xs:string"/>
<xs:attribute name="VER" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="MyFile" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:schema>
My XML. Comments are in it after the GROUP, UNITNAME and SUBUNIT nodes for references to node length, structure and potential XML file size (potentially 100k GROUP nodes for a file of about 500-600 MB each). Ellipses (…) just mark that there is an iteration of that specific node and the number of iterations I hope you find implicit in the node values. It may be that several of these files have to be imported into a SQL Server db at any given time.
<?xml version="1.0" encoding="utf-8"?>
<MyFile xmlns="urn:MyFile-schema">
<FIELD1>FileName</ FIELD1>
<FIELD2>Foo</ FIELD2>
<FIELD3>Bar</ FIELD3>
<FIELD4>Upload</ FIELD4>
<FIELD5>UserName</ FIELD5>
<FIELD6>UploadTime</ FIELD6>
<GROUP VER="Yes" FULL="false"> ‘<---- There can be up to 100K of these (maybe more soon) ---identified by child node GROUPID
<GROUPID>GrName1</ GROUPID>
<GROUPCATEGORY>MyCategory</ GROUPCATEGORY>
<UNIT KIT=”1” PARTIALUNIT ="false" MULTIENTRY=”Yes”> ‘<---- 12 to 35 possible in each GROUP NODE ---identified by child UNITNAME
<UNITNAME>Unit1</ UNITNAME>
<REVIEWER>UserName</ REVIEWER>
<DATEANDTIME>DateTime</ DATEANDTIME>
<SUBUNIT> ‘<---- 2 in each UNIT NODE most of time – will be same tag SUBUNIT
<SUBUNITPRICE>11.50</ SUBUNITPRICE>
</ SUBUNIT>
< SUBUNIT> ‘<-sometimes 3 in intermittent unit nodes but rarely included in file
< SUBUNITPRICE >20.00</ SUBUNITPRICE>
</ SUBUNIT>
</ UNIT>
...
<UNIT KIT=”1” PARTIALUNIT ="false" MULTIENTRY=”Yes”>
<UNITNAME>Unit23</ UNITNAME>
<REVIEWER>UserName</ REVIEWER>
<DATEANDTIME>DateTime</ DATEANDTIME>
<SUBUNIT>
<SUBUNITPRICE>$11.50</ SUBUNITPRICE>
</ SUBUNIT>
< SUBUNIT>
<SUBUNITPRICE>$20.00</ SUBUNITPRICE>
</ SUBUNIT>
</ UNIT>
</GROUP>
<GROUP VER="Yes" FULL="false">
<GROUPID>GroupName100,000</ GROUPID>
...
</GROUP>
</MyFile>
Here's what I've been doing with it.
Schema import:
--Import the schema
IF EXISTS(SELECT * FROM sys.xml_schema_collections where [name] = 'XMLSchema')
DROP XML SCHEMA COLLECTION[XMLSchema]
DECLARE @MySchema XML
SET @MySchema =
(SELECT * FROM OPENROWSET
(BULK 'C:\Path\Schema.xsd', SINGLE_BLOB) AS xmlData
)
CREATE XML SCHEMA COLLECTION[XMLSchema] AS @MySchema
Load the file. This takes about 2 minutes for a 50k GROUP Node file, I have no idea how long it will take for 100k. I'd like to speed this up. I add an index to the xml column. I'm not quite sure about this part except that I picked it up as a bit of advice that it speeds things up a bit with queries downstream and I know it runs. I do know that I could shave some time and resources by editing out the unnecessary attributes (all of them) and fields but I've found that it just tacks the time on the front end doing the work of removing them.
I understand that if I add secondary indices to the XML column it could really speed things up. I have no idea
where to start with that. If anyone has got some suggestions for a
reference, or a quick addition, I'd appreciate the input.
CREATE TABLE XMLTestTable
(
ID INT IDENTITY PRIMARY KEY,
DATA xml(CONTENT MyXmlSchema)
)
INSERT INTO XMLTestTable
(DATA)
SELECT CONVERT(XML, BulkColumn) as BulkColumn
--import an xml file into the column
FROM OPENROWSET(BULK 'C:\Path\FileName.XML', SINGLE_BLOB) as x
CREATE PRIMARY XML INDEX PXML_DATA
ON XMLTestTable (DATA)
Next: I need to put the GROUPID and SUBUNITPRICEs for each GROUPID into a table, Ill call it GROUPTABLE and it hopefully look like this:
|ID |GROUPID|UNIT1_SUBUNITPRICE_1|UNIT1_SUBUNITPRICE_2|……|UNIT23_SUBUNITPRICE_2|
|1 |GrName1|11.50 |20.00 |……|25.00 |
|2 |GrName2|1.00 |32.41 |……|45.51 |
So I create the tables:
CREATE TABLE GROUPTABLE
(
ID int IDENTITY(1,1) PRIMARY KEY,
GROUPID varchar(20),
UNIT1_SUBUNITPRICE_1 varchar(7),
UNIT1_SUBUNITPRICE_2 varchar(7),
…
UNIT23_SUBUNITPRICE_1 varchar(7),
UNIT23_SUBUNITPRICE_2 varchar(7)
)
Now the part that generates the error! For demonstrative purposes, assume Ive only added the ID and GROUPID fields to the table here. If I start by just trying to insert the index and the groupid, it works great! Like this:
–-migrate the data from the xmlcolumn to the table
WITH XMLNAMESPACES(DEFAULT 'urn:MyFile-schema')
INSERT INTO GROUPTABLE
Select
t.b.value('GROUPID[1]', 'varchar(20)') AS GROUPID
FROM XMLTestTable
CROSS APPLY
DATA.nodes('//MyFile/GROUP) AS t(b)
…and then the problem. Here assume Ive only added the ID and UNIT1_SUBUNITPRICE_1 to the table. Let's try to add the subunitprices:
–-migrate the data from the xmlcolumn to the table
WITH XMLNAMESPACES(DEFAULT 'urn:MyFile-schema')
INSERT INTO GROUPTABLE
Select
t.b.value('UNIT[UNITNAME=“Unit1”]/../SUBUNIT[1]/SUBUNITPRICE[1]', 'varchar(7)') AS SUBUNITPRICE_1
FROM XMLTestTable
CROSS APPLY
DATA.nodes('//MyFile/GROUP) AS t(b)
Generating the error I mentioned way back at the beginning.
XQuery [XMLTestTable.DATA.value()]: Cannot implicitly atomize or apply
'fn:data()' to complex content elements, found type 'xs:anyType'
within inferred type
'element({urn:MyFile-schema}:SUBUNITPRICE,xs:anyType) *'.
If you have stuck with me through this long thing, I appreciate it! If you got this far here are my questions:
- What am I doing wrong with that xpath expression in the query?
- What can I do to speed up the import of the XML file into the column?
- What can I do to speed up the migration of the column into the table?
- Would it be better to make a bunch of tables for each UNIT instead of 1 table for all of them?
- Last, I hear rumors of people taking an XML object and parsing the file say 10k rows at a time and looping through them in a procedure
like this (import part of the file as an object variable, put it in
the table, repeat on multiple rows and then migrate at 10k rows at a time also). Is this
possible and would it help?
Update 10/30/2018
As far as importing xml files I tried several things.
-
I messed around with using an xml variable instead of dumping it into a table and I found that the variable loads just a little faster…saves maybe a second or two on the loading of the xml. I also compared loading straight into the table vs loading the xml into a variable then inserting the variable into the column. They took the same amount of time, or it was at least negligible. I have yet to compare whether querying the variable or the column is faster and whether or not indexing has an impact although everything I've read indicates that the indexing (particularly secondary indices with more nested xmls) decreases post load query time greatly. At the moment, my tests in this update did not involve a variable. I'm not quite sure where to go from here but I have some ideas and will update if and when I do.
-
Not validating to a schema is faster. If the file is already validated, or you can be confident in the data, this isn't really worth it. I removed this from further testing for now.
-
Removing the primary key on the xml column allows a much faster load into a column. Since there is so much overhead on storing the data as an xml and my goal is to convert it to RDBMS form, this also is probably not worth it but I am unsure yet. For now, I have removed this from further tests too.
-
I messed with loading multiple files of smaller numbers of the GROUP parent node and it made a big difference as far as loading the xml (I have yet to see how it affects query and conversion time). I created sets of xmls in incremental amounts up to 10,000 of the GROUP node and then 50,000 of the GROUP node, then wrote a dynamic sql query to bulk import them (I'll post the query after the results I found). The query puts the xml directly in a table/xml column.
The first test I ran was to compare a direct load into a table, a single xml file of 10,000 GROUP nodes to loading 10,000 files with one GROUP node each.
1 file of 10,000 GROUP nodes took 2 seconds to load into 1 row
10,000 files of 1 GROUP node took 19 seconds to load into 10,000 rows
The second test I ran was to compare varying amounts of 50,000 GROUP nodes.
10 files of 5,000 GROUP nodes took 13 seconds to load into 10 rows
5 files of 10,000 GROUP nodes took 6 seconds to load into 5 rows
1 file of 50,000 GROUP nodes took 12 seconds to load into 1 row
Clearly, 10k at a time is much better than anything else here, at
least for 50K GROUP Nodes. Maybe that would change if I was loading
say 100K nodes. Maybe I'll try that tomorrow and report back.
NOTE: I used the query timer in the bottom right hand corner so precision here is 1 second.
Here is the dynamic sql query I used to do this:
--variables for dynsql and loop
DECLARE @i int --I named the files with an integer so I could insert them into the loop
DECLARE @dsql varchar(Max)
--create xml table
CREATE TABLE xTable
(
ID INT IDENTITY PRIMARY KEY,
xData XML NOT NULL
);
--loop dynsql to import xml to xTable
Set @i = '1'
WHILE @i <=1 --I changed this depending on how many files I was loading
BEGIN
Set @dsql = 'INSERT INTO xTable(xData)
SELECT CONVERT(XML, BulkColumn) AS BulkColumn
FROM OPENROWSET(Bulk ' + Char(39) + 'C:\MyPath\' + CAST(@i as nvarchar(5)) + '.xml' + CHAR(39) + ', SINGLE_BLOB) x;'
Exec(@dsql)
SET @i = @i + 1
END;
This poses more questions for me that I'm not looking for answers for here but writing them for anyone following to ponder.
-
What is going to be the difference in xml to RDBMS conversion between 1 row of 50k GROUP Nodes and 5 rows of 10k GROUP Nodes?
-
Converting 1 xml of 50k GROUP Nodes to 5 xmls of 10k GROUP Nodes takes a while (I did it with Excel VBA…I know I could use something else but its what I have and know). Im guessing the most judicious way of going about it will probably do some XSLT work (this is something I need to learn)
-
In doing this it made me wonder if there was a quicker way to parse the xml into 10k node pieces via the sql server xml variable in sql server and would it be quicker than doing it outside of sql server (I don't know where to begin here).
For anyone still following or maybe if you happened on this thread
100 files with 1,000 GROUP nodes each – 25 seconds
40 files with 2,500 GROUP nodes each – 24 seconds
20 files with 5,000 GROUP nodes each – 24 seconds
10 files with 10,000 GROUP nodes each – 24 seconds
4 files with 25,000 GROUP nodes each – 20 seconds
2 files with 50,000 GROUP nodes each – 22 seconds
1 file with 100,000 GROUP nodes – 22 seconds
Best Answer
You should put the
UNITNAME
element in the predicate(UNIT[UNITNAME = "Unit1"]/SUBUNIT/SUBUNITPRICE)[1]
A working query would look something like this where
@X
has the entire XML:Nothing as far as I know. I would however load it to a variable instead of a table.
If you don't need to validate the XML you could skip the schema. It takes time to validate. Some says that using a schema will speed up shredding the XML and that would be true if you tried the query above with and without schema. If you don't use a schema you need to rewrite the query to specify the
text()
node and then you have the same shredding performance without the schema as without.It looks to me like you will eventually end up with a query that returns almost 50 columns. In the current implementation of querying the XML datatype in SQL Server that will be slow. More info here and vote for a change here.
The alternative to use OPENXML instead would look like this for you.
Note: You should not use the
text()
node in OPENXML, it will be slower.