Sql-server – SQL Server 2008R2 XML Column Index strategy for large tables

indexindex-tuningsql serversql-server-2008-r2xml

I am working with SQL Server 2008 R2 and some XML columns in a very large table. I understand that I cannot use Selective XML Indexes due to our SQL Server version being < 2012. I'm brand spanking new to the idea of indexing XML columns.

The Goals

I have two separate scenarios in which I need to query XML:

In one case, I need to query if a nested value exists across a table with 3 million rows
- E.g. columnName.exist('Parent[1]/Child[1]/DifferentChild[1]') = 1'
In another case, I need to extract two values from XML into two columns so that I can match on it, in a table with 4.7 million rows.
- E.g. SELECT columnName.value('Parent[1]/FilePath[1]', 'nvarchar(max)') as FilePath, columnName.value('Parent[1]/FileName[1]', 'nvarchar(255)') as FileName

Question

Given the cost of the indexes here and the fixed type of querying, what are indexes I can apply for maximum impact, and are there any pitfalls I should be aware of?

Best Answer

There a number of approaches to tuning XML queries in SQL Server. Property promotion is a good one, but I also regularly use the following:

XML Indexes

XML Indexes can transform XML query performance, but at a cost. Pre-SQL Server 2012, they come in two types, primary XML indexes and secondary XML indexes. You always need a primary XML index, and can optionally add PATH, PROPERTY or VALUE indexes which serve slightly different purposes. For your particular queries, a secondary PATH index gives step-change performance improvement in my simple rig below, eg:

CREATE PRIMARY XML INDEX xmlidx_largeTable ON dbo.largeTable ( yourXML )
GO

CREATE XML INDEX xpthidx_largeTable ON dbo.largeTable ( yourXML )
USING XML INDEX xmlidx_largeTable FOR PATH
GO

Now to the cost. XML Indexes (prior to Selective XML Indexes) have a huge storage impact. I have seen tables grow up to 5x in size. In my test rig below, the table with 3 million rows and very simple XML goes from 0.7GB to 2GB with primary XML index, then 2.7GB with the PATH secondary index. Selective XML indexes in SQL Server 2012 onwards can improve on this massively.

Best practice syntax

I use CROSS APPLY when there are multiple levels of XML to drill down left to right. See the use of CROSS APPLY in my rig below. Also, avoid use of the parent axis (..) to drill back up. This can cause performance problems especially with larger pieces of XML as per here.

I also always use the text() accessor with untyped XML, eg

SELECT
    p.c.value('(FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    p.c.value('(FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t
    CROSS APPLY t.yourXML.nodes('Parent') p(c)

This is mentioned here and I've seen this technique give up to 15% performance improvement. YMMV. Moving the ordinal ([1]) to the end of the expression is more efficient and syntactically equivalent to Parent[1]/FilePath[1]/SomeOtherElement[1].

XML Schema Collection

These don't tend to bring performance improvement, but are a good practice, as like a constraint, they force the XML to have a certain structure.

Full-text Indexing

I occasionally combine Full-text indexing with XML with good results, eg here. It's probably not appropriate in this example as you don't seem to have any criteria.

Test Rig

In my simple test rig, I create a simple table with 3 million rows and a simple piece of XML in each row. I then try different combinations of syntax and XML indexes to see the difference:

USE tempdb
GO

SET NOCOUNT ON
GO

------------------------------------------------------------------------------------------------
-- Setup START
------------------------------------------------------------------------------------------------

-- Create a large table 
IF OBJECT_ID('dbo.largeTable') IS NOT NULL DROP TABLE dbo.largeTable
CREATE TABLE dbo.largeTable ( 
    rowId INT IDENTITY PRIMARY KEY, 
    someData UNIQUEIDENTIFIER DEFAULT NEWID(), 
    dateAdded DATETIME DEFAULT GETDATE(), 
    addedBy VARCHAR(30) DEFAULT SUSER_NAME(),
    yourXML XML,
    ts ROWVERSION
    )
GO


-- Add 3 million rows to the table; with simple piece of XML in each row
;WITH cte AS (
SELECT TOP 1000000 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
FROM master.sys.columns c1
    CROSS JOIN master.sys.columns c2
    CROSS JOIN master.sys.columns c3
)
INSERT INTO dbo.largeTable ( someData, yourXML )
SELECT NEWID(),
    (
    SELECT
        rn AS Child,
        'DifferentChild' + CAST( CASE WHEN rn % 9999 = 0 THEN rn % 33 ELSE NULL END AS VARCHAR(10) ) AS "Child/DifferentChild",
        'FilePath' + CAST( rn AS VARCHAR(10) ) AS FilePath,
        'FileName' + CAST( rn AS VARCHAR(10) ) AS [FileName]
    FOR XML PATH('Parent'), TYPE
    )
FROM cte
GO 3

-- Setup END
------------------------------------------------------------------------------------------------



------------------------------------------------------------------------------------------------
-- Queries START
------------------------------------------------------------------------------------------------

-- Query 1:
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('Parent[1]/Child[1]/DifferentChild[1]') = 1

-- Improve query
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('(Parent/Child/DifferentChild)[1]') = 1
GO


-- Query 2:
SELECT
    yourXML.value('Parent[1]/FilePath[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('Parent[1]/FileName[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    yourXML.value('(Parent/FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('(Parent/FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    p.c.value('(FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    p.c.value('(FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t
    CROSS APPLY t.yourXML.nodes('Parent') p(c)
GO

-- Queries END
------------------------------------------------------------------------------------------------


CREATE PRIMARY XML INDEX xmlidx_largeTable ON dbo.largeTable ( yourXML )
GO


------------------------------------------------------------------------------------------------
-- Queries START
------------------------------------------------------------------------------------------------

-- Query 1:
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('Parent[1]/Child[1]/DifferentChild[1]') = 1

-- Improve query
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('(Parent/Child/DifferentChild)[1]') = 1
GO


-- Query 2:
SELECT
    yourXML.value('Parent[1]/FilePath[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('Parent[1]/FileName[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    yourXML.value('(Parent/FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('(Parent/FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    p.c.value('(FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    p.c.value('(FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t
    CROSS APPLY t.yourXML.nodes('Parent') p(c)
GO

-- Queries END
------------------------------------------------------------------------------------------------


CREATE XML INDEX xpthidx_largeTable ON dbo.largeTable ( yourXML )
USING XML INDEX xmlidx_largeTable FOR PATH
GO



------------------------------------------------------------------------------------------------
-- Queries START
------------------------------------------------------------------------------------------------

-- Query 1:
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('Parent[1]/Child[1]/DifferentChild[1]') = 1

-- Improve query
SELECT *
FROM dbo.largeTable t
WHERE t.yourXML.exist('(Parent/Child/DifferentChild)[1]') = 1
GO


-- Query 2:
SELECT
    yourXML.value('Parent[1]/FilePath[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('Parent[1]/FileName[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    yourXML.value('(Parent/FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    yourXML.value('(Parent/FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t


SELECT
    p.c.value('(FilePath/text())[1]', 'nvarchar(max)') AS FilePath,
    p.c.value('(FileName/text())[1]', 'nvarchar(255)') AS [FileName]
FROM dbo.largeTable t
    CROSS APPLY t.yourXML.nodes('Parent') p(c)
GO

-- Queries END
------------------------------------------------------------------------------------------------

My results:

enter image description here

So in summary, hopefully you can see, you can get step-change performance with your XML queries using the right features in combination, but with a hefty storage cost.

Recommended Reading

Performance Optimizations for the XML Data Type in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345118.aspx

XML Indexes in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345121(SQL.90).aspx

Answer

Since you refer to the website use-the-index-luke.com, consider the chapter:

Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Rule of thumb: index for equality first — then for ranges.

Also good for just one column?

What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
     , (random() * 4)::int     AS few
FROM generate_series (1, 100000);

DELETE FROM t WHERE random() > 0.9;  -- create some dead tuples, more "real-life"

ANALYZE t;

SELECT count(distinct lots)   -- 9999
     , count(distinct few)    --    5
FROM   t;

Query:

SELECT *
FROM   t
WHERE  lots = 2345
AND    few = 2;

EXPLAIN ANALYZE output (Best of 10 to exclude caching effects):

Seq Scan on t  (cost=0.00..5840.84 rows=2 width=8)
               (actual time=5.646..15.535 rows=2 loops=1)
  Filter: ((lots = 2345) AND (few = 2))
  Buffers: local hit=443
Total runtime: 15.557 ms

Add index, retest:

CREATE INDEX t_lf_idx ON t(lots, few);

Index Scan using t_lf_idx on t  (cost=0.00..3.76 rows=2 width=8)
                                (actual time=0.008..0.011 rows=2 loops=1)
  Index Cond: ((lots = 2345) AND (few = 2))
  Buffers: local hit=4
Total runtime: 0.027 ms

Add other index, retest:

DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx  ON t(few, lots);

Index Scan using t_fl_idx on t  (cost=0.00..3.74 rows=2 width=8)
                                (actual time=0.007..0.011 rows=2 loops=1)
  Index Cond: ((few = 2) AND (lots = 2345))
  Buffers: local hit=4
Total runtime: 0.027 ms

Postgresql – Guidelines on best indexing strategy for varying searches on 20+ columns

Capture the queries that people actually make.
Analyze data from #1; then devise short, multi-column, indexes (2, maybe 3, columns each)
No. Especially not for yes/no flags. It's ok to combine such, as in #2.
See #3.
The order of columns is important in an index. A 5-column index is (roughly) equivalent to the 1-col index on its first column, but is not useful if you don't query on that column.
Probably.
Forcing into memory is counter-productive because it takes away space for caching other stuff.

More discussion: http://mysql.rjweb.org/doc.php/eav