Sql-server – Why is the secondary selective index not used when the where clause filters on `value()`

execution-planperformancequery-performancesql serversql-server-2012xml

Setup:

create table dbo.T
(
  ID int identity primary key,
  XMLDoc xml not null
);

insert into dbo.T(XMLDoc)
select (
       select N.Number
       for xml path(''), type
       )
from (
     select top(10000) row_number() over(order by (select null)) as Number
     from sys.columns as c1, sys.columns as c2
     ) as N;

Sample XML for each row:

<Number>314</Number>

The job for the query is to count the number of rows in T with a specified value of <Number>.

There are two obvious ways to do this:

select count(*)
from dbo.T as T
where T.XMLDoc.value('/Number[1]', 'int') = 314;

select count(*)
from dbo.T as T
where T.XMLDoc.exist('/Number[. eq 314]') = 1;

It turns out that value() and exists() requires two different path definitions for the selective XML index to work.

create selective xml index SIX_T on dbo.T(XMLDoc) for
(
  pathSQL = '/Number' as sql int singleton,
  pathXQUERY = '/Number' as xquery 'xs:double' singleton
);

The sql version is for value() and the xquery version is for exist().

You might think that an index like that would give you a plan with a nice seek but selective XML indexes are implemented as a system table with the primary key of T as the lead key of the clustered key of the system table. The paths specified are sparse columns in that table. If you want an index of the actual values of the defined paths you need to create a secondary selective indexes, one for each path expression.

create xml index SIX_T_pathSQL on dbo.T(XMLDoc)
  using xml index SIX_T for (pathSQL);

create xml index SIX_T_pathXQUERY on dbo.T(XMLDoc)
  using xml index SIX_T for (pathXQUERY);

The query plan for the exist() does a seek in the secondary XML index followed by a key lookup in the system table for the selective XML index (don't know why that is needed) and finaly it does a lookup in T to make sure there actually are rows in there. The last part is necessary because there is no foreign key constraint between the system table and T.

enter image description here

The plan for the value() query is not so nice. It does a clustered index scan of T with a nested loops join against a seek on the internal table to get the value from the sparse column and finally filters on the value.

enter image description here

If a selective index should be used or not is decided before optimization but if a secondary selective index should be used or not is a cost based decision by the optimizer.

Why is the secondary selective index not used when the where clause filters onvalue()?

Update:

The queries are semantically different. If you add a row with the value

<Number>313</Number>
<Number>314</Number>` 

the exist() version would count 2 rows and the values() query would count 1 row. But with the index definitions as they are specified here using the singleton directive SQL Server will prevent you from adding a row with multiple <Number> elements.

That does however not let us use the values() function without specifying [1] to guarantee the compiler that we will only get a single value. That [1] is the reason we have a Top N Sort in the value() plan.

Looks like I am closing in on an answer here…

Best Answer

The declaration of singleton in the path expression of the index enforces that you can not add multiple <Number> elements but the XQuery compiler does not take that into consideration when interpreting the expression in the value() function. You have to specify [1] to make SQL Server happy. Using typed XML with a schema does not help with that either. And because of that SQL Server builds a query that uses something that could be called an "apply" pattern.

Easiest to demonstrate is to use regular tables instead of XML simulating the query we are actually executing against T and the internal table.

Here is the setup for the internal table as a real table.

create table dbo.xml_sxi_table
(
  pk1 int not null,
  row_id int,
  path_1_id varbinary(900),
  pathSQL_1_sql_value int,
  pathXQUERY_2_value float
);

go

create clustered index SIX_T on xml_sxi_table(pk1, row_id);
create nonclustered index SIX_pathSQL on xml_sxi_table(pathSQL_1_sql_value) where path_1_id is not null;
create nonclustered index SIX_T_pathXQUERY on xml_sxi_table(pathXQUERY_2_value) where path_1_id is not null;

go

insert into dbo.xml_sxi_table(pk1, row_id, path_1_id, pathSQL_1_sql_value, pathXQUERY_2_value)
select T.ID, 1, T.ID, T.ID, T.ID
from dbo.T;

With both tables in place you can execute the equivalent of the exist() query.

select count(*)
from dbo.T
where exists (
             select *
             from dbo.xml_sxi_table as S
             where S.pk1 = T.ID and
                   S.pathXQUERY_2_value = 314 and
                   S.path_1_id is not null
             );

The equivalent of the value() query would look like this.

select count(*)
from dbo.T
where (
      select top(1) S.pathSQL_1_sql_value
      from dbo.xml_sxi_table as S
      where S.pk1 = T.ID and
            S.path_1_id is not null
      order by S.path_1_id
      ) = 314;

The top(1) and order by S.path_1_id is the culprit and it is [1] in the Xpath expression that is to blame.

I don't think it is possible for Microsoft to fix this with the current structure of the internal table even if you were allowed to leave out the [1] from the values() function. They would probably have to create multiple internal tables for each path expression with unique constraints in place to guarantee for the optimizer that there can only be one <number> element for each row. Not sure that would actually be enough for the optimizer to "break out of the apply pattern".

For you who think this fun and interesting and since you are still reading this you probably are.

Some queries to look at the structure of the internal table.

select T.name, 
       T.internal_type_desc, 
       object_name(T.parent_id) as parent_table_name
from sys.internal_tables as T
where T.parent_id = object_id('T');

select C.name as column_name, 
       C.column_id,
       T.name as type_name,
       C.max_length,
       C.is_sparse,
       C.is_nullable
from sys.columns as C
  inner join sys.types as T
    on C.user_type_id = T.user_type_id
where C.object_id in (
                     select T.object_id 
                     from sys.internal_tables as T 
                     where T.parent_id = object_id('T')
                     )
order by C.column_id;

select I.name as index_name,
       I.type_desc,
       I.is_unique,
       I.filter_definition,
       IC.key_ordinal,
       C.name as column_name, 
       C.column_id,
       T.name as type_name,
       C.max_length,
       I.is_unique,
       I.is_unique_constraint
from sys.indexes as I
  inner join sys.index_columns as IC
    on I.object_id = IC.object_id and
       I.index_id = IC.index_id
  inner join sys.columns as C
    on IC.column_id = C.column_id and
       IC.object_id = C.object_id
  inner join sys.types as T
    on C.user_type_id = T.user_type_id
where I.object_id in (
                     select T.object_id 
                     from sys.internal_tables as T 
                     where T.parent_id = object_id('T')
                     );