Sql-server – Very strange performance with an XML index

performancesql serversql-server-2012xml

My question is based on this: https://stackoverflow.com/q/35575990/5089204

To give an answer there I did the following test-scenario.

Test scenario

First I create a test table and fill it with 100.000 rows. A random number (0 to 1000) should lead to ~100 rows for each random number. This number is put into a varchar col and as a value into your XML.

Then I do a call like the OP there needs it with .exist() and with .nodes() with a small advantage for the second, but both take 5 to 6 seconds. In fact I do the calls twice: a second time in swapped order and with slightly changed search params and with "//item" instead of the full path to avoid false positives via cached results or plans.

Then I create an XML index and do the same calls

Now – what really did surprise me! – the .nodes with full path is much slower than before (9 secs) but the .exist() is down to half a second, with full path even down to about 0.10 sec. (while .nodes() with short path is better, but still far behind .exist())

Questions:

My own tests bring up in short: XML indexes can blow up a database extremely. They can speed up things extremely (s. edit 2), but can slow your queries down too. I'd like to understand how they work… When should one create an XML index? Why can .nodes() with an index be worse than without? How could one avoid the negativ impact?

CREATE TABLE #testTbl(ID INT IDENTITY PRIMARY KEY, SomeData VARCHAR(100),XmlColumn XML);
GO

DECLARE @RndNumber VARCHAR(100)=(SELECT CAST(CAST(RAND()*1000 AS INT) AS VARCHAR(100)));

INSERT INTO #testTbl VALUES('Data_' + @RndNumber,
'<error application="application" host="host" type="exception" message="message" >
  <serverVariables>
    <item name="name1">
      <value string="text" />
    </item>
    <item name="name2">
      <value string="text2" />
    </item>
    <item name="name3">
      <value string="text3" />
    </item>
    <item name="name4">
      <value string="text4" />
    </item>
    <item name="name5">
      <value string="My test ' +  @RndNumber + '" />
    </item>
    <item name="name6">
      <value string="text6" />
    </item>
    <item name="name7">
      <value string="text7" />
    </item>
  </serverVariables>
</error>');

GO 100000

DECLARE @d DATETIME=GETDATE()
SELECT #testTbl.*
FROM #testTbl
CROSS APPLY XmlColumn.nodes('/error/serverVariables/item[@name="name5" and value/@string="My test 600"]') AS a(b);
SELECT CAST(GETDATE()-@d AS TIME) AS NodesFullPath_no_index;
GO

DECLARE @d DATETIME=GETDATE();
SELECT * 
FROM #testTbl
WHERE XmlColumn.exist('/error/serverVariables/item[@name="name5" and value/@string="My test 600"]') = 1;
SELECT CAST(GETDATE()-@d AS TIME) AS ExistFullPath_no_index;
GO

DECLARE @d DATETIME=GETDATE();
SELECT * 
FROM #testTbl
WHERE XmlColumn.exist('//item[@name="name5" and value/@string="My test 500"]') = 1;
SELECT CAST(GETDATE()-@d AS TIME) AS ExistShortPath_no_index;
GO

DECLARE @d DATETIME=GETDATE()
SELECT #testTbl.*
FROM #testTbl
CROSS APPLY XmlColumn.nodes('//item[@name="name5" and value/@string="My test 500"]') AS a(b);
SELECT CAST(GETDATE()-@d AS TIME) AS NodesShortPath_no_index;
GO

CREATE PRIMARY XML INDEX PXML_test_XmlColum1 ON #testTbl(XmlColumn);
CREATE XML INDEX IXML_test_XmlColumn2 ON #testTbl(XmlColumn) USING XML INDEX PXML_test_XmlColum1 FOR PATH;
GO

DECLARE @d DATETIME=GETDATE()
SELECT #testTbl.*
FROM #testTbl
CROSS APPLY XmlColumn.nodes('/error/serverVariables/item[@name="name5" and value/@string="My test 600"]') AS a(b);
SELECT CAST(GETDATE()-@d AS TIME) AS NodesFullPath_with_index;
GO

DECLARE @d DATETIME=GETDATE();
SELECT * 
FROM #testTbl
WHERE XmlColumn.exist('/error/serverVariables/item[@name="name5" and value/@string="My test 600"]') = 1;
SELECT CAST(GETDATE()-@d AS TIME) AS ExistFullPath_with_index;
GO

DECLARE @d DATETIME=GETDATE();
SELECT * 
FROM #testTbl
WHERE XmlColumn.exist('//item[@name="name5" and value/@string="My test 500"]') = 1;
SELECT CAST(GETDATE()-@d AS TIME) AS ExistShortPath_with_index;
GO

DECLARE @d DATETIME=GETDATE()
SELECT #testTbl.*
FROM #testTbl
CROSS APPLY XmlColumn.nodes('//item[@name="name5" and value/@string="My test 500"]') AS a(b);
SELECT CAST(GETDATE()-@d AS TIME) AS NodesShortPath_with_index;
GO

DROP TABLE #testTbl;

EDIT 1 – Results

This is one result with SQL Server 2012 locally installed on a medium laptop
In this test I could not reproduce the extremly negativ impact on NodesFullPath_with_index, although it is slower than without the index…

NodesFullPath_no_index    6.067
ExistFullPath_no_index    6.223
ExistShortPath_no_index   8.373
NodesShortPath_no_index   6.733

NodesFullPath_with_index  7.247
ExistFullPath_with_index  0.217
ExistShortPath_with_index 0.500
NodesShortPath_with_index 2.410

EDIT 2 Test with bigger XML

According to TT's suggestion I used the XML above, but copied the item-nodes to reach about 450 items. I let the hit-node be very high up in the XML ('cause I think that .exist() would stop on the first hit, while .nodes() would continue)

Creating the XML-index blew up the mdf-file to ~21GB, ~18GB seem to belong to the index (!!!)

NodesFullPath_no_index    3min44
ExistFullPath_no_index    3min39
ExistShortPath_no_index   3min49
NodesShortPath_no_index   4min00

NodesFullPath_with_index  8min20
ExistFullPath_with_index  8,5 seconds !!!
ExistShortPath_with_index 1min21
NodesShortPath_with_index 13min41 !!!

Best Answer

There sure is a lot going on here so we will just have to see where this leads.

First off, the difference in timing between SQL Server 2012 and SQL Server 2014 is due to the new cardinality estimator in SQL Server 2014. You can use a trace flag in SQL Server 2014 to force the old estimator and then you will see the same timing characteristics in SQL Server 2014 as in SQL Server 2012.

Comparing nodes() vs exist() is not fair since they will not return the same result if there are more than one matched element in the XML for one row. exist() will return one row from the base table regardless, whereas nodes() can potentially give you more than one row returned for each row in the base table.
We know the data but SQL Server does not and has to build a query plan that takes that into consideration.

To make the nodes() query equivalent to the exist() query, you could do something like this.

SELECT testTbl.*
FROM testTbl
WHERE EXISTS (
             SELECT *
             FROM XmlColumn.nodes('/error/serverVariables/item[@name="name5" and value/@string="My test 600"]') AS a(b)
             )

With a query like that there is no difference between using nodes() or exist() and that is because SQL Server builds almost the same plan for the two versions not using an index and exactly the same plan when index is used. That is true both for SQL Server 2012 and SQL Server 2014.

For me in SQL Server 2012 the queries without the XML index take 6 seconds using the modified version of the nodes() query above. There is no difference between using the full path or the short path. With the XML index in place the full path version is the fastest and takes 5 ms and using the short path takes about 500 ms. Examining the query plans will tell you why there is a difference but the short version is that when you use a short path, SQL Server seeks in the index on the short path (a range seek using like) and returns 700000 rows before discarding the rows that do not match on the value. When using the full path, SQL Server can use the path expression directly together with the value of the node to do the seek and returns only 105 rows from scratch to work on.

Using SQL Server 2014 and the new cardinalty estimator, there is no difference in these queries when using an XML index. Without using the index the queries still take the same amount of time but it is 15 seconds. Clearly not an improvement here when using new stuff.

Not sure if I completely lost track of what your question is actually about since I modified the queries to be equivalent but here is what I believe it is now.

Why is the nodes() query (original version) with an XML index in place significantly slower then when an index is not used?

Well, the answer is that SQL Server query plan optimizer does something bad and that is introducing a spool operator. I don't know why but the good news is that it is not there anymore with the new cardinalty estimator in SQL Server 2014.
With no indexes in place the query takes about 7 seconds no matter what cardinality estimator is used. With the index it takes 15 seconds with the old estimator (SQL Server 2012) and about 2 seconds with the new estimator (SQL Server 2014).

Note: The findings above are valid with your test data. There can be a whole different story to tell if you change the size, shape or form of the XML. No way to know for sure without testing with the data you actually have in the tables.

How the XML indexes work

XML indexes in SQL Server are implemented as internal tables. The primary XML index creates the table with the primary key of the base table plus node id column, in total 12 columns. It will have one row per element/node/attribute etc. so that table can of course get really big depending on the size of the XML stored. With a primary XML index in place SQL Server can use the primary key of the internal table to locate XML nodes and values for each row in the base table.

Secondary XML indexes come in three types. When you create a secondary XML index, there is a non-clustered index created on the internal table and, depending on what type of secondary index you create, it will have different columns and column orders.

From CREATE XML INDEX (Transact-SQL):

VALUE
Creates a secondary XML index on columns where key columns are (node value and path) of the primary XML index.

PATH
Creates a secondary XML index on columns built on path values and node values in the primary XML index. In the PATH secondary index, the path and node values are key columns that allow efficient seeks when searching for paths.

PROPERTY
Creates a secondary XML index on columns (PK, path and node value) of the primary XML index where PK is the primary key of the base table.

So when you create a PATH index, the first column in that index is the path expression and the second column is the value in that node. Actually, the path is stored in a kind of compressed format and reversed. That it is stored reversed is what makes it useful in searches using short path expressions. In your short path case you searched for //item/value/@string, //item/@name and //item. Since the path is stored reversed in the column, SQL Server can use a range seek with like = '€€€€€€% where €€€€€€ is the path reversed. When you use a full path, there is no reason to use like since the entire path is encoded in the column and the value can also be used in the seek predicate.

Your questions:

When should one create an XML index?

As a last resort if ever. Better to design your database so you don't have to use values inside XML to filter on in a where clause. If you know beforehand that you need to do that, you could use property promotion to create a computed column that you can index if needed. Since SQL Server 2012 SP1, you also have selective XML indexes available. The workings behind the scene are pretty much the same as with regular XML indexes, only you specify the path expression in the index definition and only the nodes that match are indexed. That way you can save a lot of space.

Why can .nodes() with an index be worse than without?

When there is an XML index created on a table, SQL Server will always use that index (the internal tables) to get the data. That decision is done before the optimizer has a say in what is fast and what is not fast. The input to the optimizer is rewritten so it is using the internal tables and after that it is up to the optimizer to do its best as with a regular query. When no index is used, there are a couple of table-valued functions that are used instead. The bottom line is that you can't tell what will be faster without testing.

How could one avoid the negative impact?

Testing