Sql-server – SQL Server 2008R2 XML Column Index strategy for large tables

indexindex-tuningsql serversql-server-2008-r2xml

I am working with SQL Server 2008 R2 and some XML columns in a very large table. I understand that I cannot use Selective XML Indexes due to our SQL Server version being < 2012. I'm brand spanking new to the idea of indexing XML columns.

The Goals

I have two separate scenarios in which I need to query XML:

  1. In one case, I need to query if a nested value exists across a table with 3 million rows
    • E.g. columnName.exist('Parent[1]/Child[1]/DifferentChild[1]') = 1'
  2. In another case, I need to extract two values from XML into two columns so that I can match on it, in a table with 4.7 million rows.
    • E.g. SELECT columnName.value('Parent[1]/FilePath[1]', 'nvarchar(max)') as FilePath, columnName.value('Parent[1]/FileName[1]', 'nvarchar(255)') as FileName

Question

  • Given the cost of the indexes here and the fixed type of querying, what are indexes I can apply for maximum impact, and are there any pitfalls I should be aware of?

Best Answer

There a number of approaches to tuning XML queries in SQL Server. Property promotion is a good one, but I also regularly use the following:

XML Indexes

XML Indexes can transform XML query performance, but at a cost. Pre-SQL Server 2012, they come in two types, primary XML indexes and secondary XML indexes. You always need a primary XML index, and can optionally add PATH, PROPERTY or VALUE indexes which serve slightly different purposes. For your particular queries, a secondary PATH index gives step-change performance improvement in my simple rig below, eg:

CREATE PRIMARY XML INDEX xmlidx_largeTable ON dbo.largeTable ( yourXML )
GO

CREATE XML INDEX xpthidx_largeTable ON dbo.largeTable ( yourXML )
USING XML INDEX xmlidx_largeTable FOR PATH
GO

Now to the cost. XML Indexes (prior to Selective XML Indexes) have a huge storage impact. I have seen tables grow up to 5x in size. In my test rig below, the table with 3 million rows and very simple XML goes from 0.7GB to 2GB with primary XML index, then 2.7GB with the PATH secondary index. Selective XML indexes in SQL Server 2012 onwards can improve on this massively.

Best practice syntax

I use CROSS APPLY when there are multiple levels of XML to drill down left to right. See the use of CROSS APPLY in my rig below. Also, avoid use of the parent axis (..) to drill back up. This can cause performance problems especially with larger pieces of XML as per here.

I also always use the text() accessor with untyped XML, eg

SELECT
    p.c.value('(FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    p.c.value('(FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t
    CROSS APPLY t.yourXML.nodes('Parent') p(c)

This is mentioned here and I've seen this technique give up to 15% performance improvement. YMMV. Moving the ordinal ([1]) to the end of the expression is more efficient and syntactically equivalent to Parent[1]/FilePath[1]/SomeOtherElement[1].

XML Schema Collection

These don't tend to bring performance improvement, but are a good practice, as like a constraint, they force the XML to have a certain structure.

Full-text Indexing

I occasionally combine Full-text indexing with XML with good results, eg here. It's probably not appropriate in this example as you don't seem to have any criteria.

Test Rig

In my simple test rig, I create a simple table with 3 million rows and a simple piece of XML in each row. I then try different combinations of syntax and XML indexes to see the difference:

USE tempdb
GO

SET NOCOUNT ON
GO

------------------------------------------------------------------------------------------------
-- Setup START
------------------------------------------------------------------------------------------------

-- Create a large table 
IF OBJECT_ID('dbo.largeTable') IS NOT NULL DROP TABLE dbo.largeTable
CREATE TABLE dbo.largeTable ( 
    rowId INT IDENTITY PRIMARY KEY, 
    someData UNIQUEIDENTIFIER DEFAULT NEWID(), 
    dateAdded DATETIME DEFAULT GETDATE(), 
    addedBy VARCHAR(30) DEFAULT SUSER_NAME(),
    yourXML XML,
    ts ROWVERSION
    )
GO


-- Add 3 million rows to the table; with simple piece of XML in each row
;WITH cte AS (
SELECT TOP 1000000 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
FROM master.sys.columns c1
    CROSS JOIN master.sys.columns c2
    CROSS JOIN master.sys.columns c3
)
INSERT INTO dbo.largeTable ( someData, yourXML )
SELECT NEWID(),
    (
    SELECT
        rn AS Child,
        'DifferentChild' + CAST( CASE WHEN rn % 9999 = 0 THEN rn % 33 ELSE NULL END AS VARCHAR(10) ) AS "Child/DifferentChild",
        'FilePath' + CAST( rn AS VARCHAR(10) ) AS FilePath,
        'FileName' + CAST( rn AS VARCHAR(10) ) AS [FileName]
    FOR XML PATH('Parent'), TYPE
    )
FROM cte
GO 3

-- Setup END
------------------------------------------------------------------------------------------------



------------------------------------------------------------------------------------------------
-- Queries START
------------------------------------------------------------------------------------------------

-- Query 1:
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('Parent[1]/Child[1]/DifferentChild[1]') = 1

-- Improve query
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('(Parent/Child/DifferentChild)[1]') = 1
GO


-- Query 2:
SELECT
    yourXML.value('Parent[1]/FilePath[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('Parent[1]/FileName[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    yourXML.value('(Parent/FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('(Parent/FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    p.c.value('(FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    p.c.value('(FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t
    CROSS APPLY t.yourXML.nodes('Parent') p(c)
GO

-- Queries END
------------------------------------------------------------------------------------------------


CREATE PRIMARY XML INDEX xmlidx_largeTable ON dbo.largeTable ( yourXML )
GO


------------------------------------------------------------------------------------------------
-- Queries START
------------------------------------------------------------------------------------------------

-- Query 1:
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('Parent[1]/Child[1]/DifferentChild[1]') = 1

-- Improve query
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('(Parent/Child/DifferentChild)[1]') = 1
GO


-- Query 2:
SELECT
    yourXML.value('Parent[1]/FilePath[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('Parent[1]/FileName[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    yourXML.value('(Parent/FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('(Parent/FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    p.c.value('(FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    p.c.value('(FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t
    CROSS APPLY t.yourXML.nodes('Parent') p(c)
GO

-- Queries END
------------------------------------------------------------------------------------------------


CREATE XML INDEX xpthidx_largeTable ON dbo.largeTable ( yourXML )
USING XML INDEX xmlidx_largeTable FOR PATH
GO



------------------------------------------------------------------------------------------------
-- Queries START
------------------------------------------------------------------------------------------------

-- Query 1:
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('Parent[1]/Child[1]/DifferentChild[1]') = 1

-- Improve query
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('(Parent/Child/DifferentChild)[1]') = 1
GO


-- Query 2:
SELECT
    yourXML.value('Parent[1]/FilePath[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('Parent[1]/FileName[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    yourXML.value('(Parent/FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('(Parent/FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    p.c.value('(FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    p.c.value('(FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t
    CROSS APPLY t.yourXML.nodes('Parent') p(c)
GO

-- Queries END
------------------------------------------------------------------------------------------------

My results:

enter image description here

So in summary, hopefully you can see, you can get step-change performance with your XML queries using the right features in combination, but with a hefty storage cost.

Recommended Reading

Performance Optimizations for the XML Data Type in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345118.aspx

XML Indexes in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345121(SQL.90).aspx