Sql-server – Query runs slowly when a non-indexed column is added to the WHERE clause

azure-sql-databasesql server

My SQL Server houses a logs table that has 4 columns:

RowID – GUID PRIMARY KEY
OperationID – GUID (NONUNIQUE INDEXED)
Line – INT
Message VARCHAR(250)

Some table facts:

An operation may have anywhere between 1 and 380,000 messages
operationID is the same for all log lines that pertain to a particular operation
Line increments from 1 for each operationID (hence the combination of operationID and Line is unique- I don't know why the original table designer added a GUID PK)
The message varies considerably
A nonclustered nonunique index is present on OperationID. It INCLUDEs no other columns
The table has about 13.5 million rows

I find the following query reasonably performant, it returns 7000 rows in 5 seconds or less. The query plan shows index seek for all rows matching the operationID, key lookup on the primary key and then nested loops to produce the resultset:

SELECT *  
FROM logs 
WHERE operationid = <some guid>

Side info: SQL Server Management Studio complains that the query would be better served if an index were created on OperationID that INCLUDEd line and message.

If I add something that I consider trivial:

SELECT * 
FROM logs 
WHERE operationid = <some guid> AND Line > 2000

Suddenly the query takes well over 3 minutes to return a result. My immediate thought was to look at the execution plan differences. The only significant difference I find is that in addition to the "seek predicate" that is associating the table rows with index rows, there is a "predicate" section in the tooltip of the table for the nonperformant query. The "predicate" is listed as "line > 2000".

Side info: SQL Server has switched to complaining that the query would be better served by indexing OperationID and Line and INCLUDEing Message.

Is SQL Server doing things in the wrong order? i.e. is it looking through 13.5 million rows for all those whose Line > 2000, and then filtering and joining to the index?

As a human I can see this query should be sub-5-second, because filtering all the rows based on the index first (13.5M -> 7K) and then filtering 7K rows for the matching Line would be a better strategy, than finding potentially millions of rows with line > 2000. I was expecting that filtering any large table on an indexed column is something SQL Server would prefer over starting with non indexed columns in a where clause?

I don't really need an index on Line, as we don't select entries based on line ranges in common use – this is just for code development, I know for this particular operation I'm analysing now that all line entries before 2000 are uninteresting. I'm more interested as to why a query WHERE indexedcol = value is super fast but WHERE indexedcol = value and unindexedcol = othervalue runs like a dog. And by that I don't mean a greyhound.

Best Answer

You're getting a secondary predicate on the key lookup operation. That's going to slow stuff down. It's not doing a full scan, but remember that even seeks are just limited scans. So the key lookup is still doing what it did before, but now it's also adding a filtering step because of that added index.

You could look at exploring a covering index, however, because you have a SELECT *, covering means all columns, effectively creating a second clustered index. There'll be some sacrifice of performance on data modification queries and on disk space (depending on where you are on Azure SQL Database, that could bump you to another tier).

Instead, I'd start with experimenting with your existing nonclustered index. See if adding this new column to the key of that index helps at all. Keep it as the second column, other wise you're changing your histogram which could cause other issues.

You're going to have to experiment to see for sure what works best in your situation.

Related Solutions

Sql-server – Why is the secondary selective index not used when the where clause filters on `value()`

The declaration of singleton in the path expression of the index enforces that you can not add multiple <Number> elements but the XQuery compiler does not take that into consideration when interpreting the expression in the value() function. You have to specify [1] to make SQL Server happy. Using typed XML with a schema does not help with that either. And because of that SQL Server builds a query that uses something that could be called an "apply" pattern.

Easiest to demonstrate is to use regular tables instead of XML simulating the query we are actually executing against T and the internal table.

Here is the setup for the internal table as a real table.

create table dbo.xml_sxi_table
(
  pk1 int not null,
  row_id int,
  path_1_id varbinary(900),
  pathSQL_1_sql_value int,
  pathXQUERY_2_value float
);

go

create clustered index SIX_T on xml_sxi_table(pk1, row_id);
create nonclustered index SIX_pathSQL on xml_sxi_table(pathSQL_1_sql_value) where path_1_id is not null;
create nonclustered index SIX_T_pathXQUERY on xml_sxi_table(pathXQUERY_2_value) where path_1_id is not null;

go

insert into dbo.xml_sxi_table(pk1, row_id, path_1_id, pathSQL_1_sql_value, pathXQUERY_2_value)
select T.ID, 1, T.ID, T.ID, T.ID
from dbo.T;

With both tables in place you can execute the equivalent of the exist() query.

select count(*)
from dbo.T
where exists (
             select *
             from dbo.xml_sxi_table as S
             where S.pk1 = T.ID and
                   S.pathXQUERY_2_value = 314 and
                   S.path_1_id is not null
             );

The equivalent of the value() query would look like this.

select count(*)
from dbo.T
where (
      select top(1) S.pathSQL_1_sql_value
      from dbo.xml_sxi_table as S
      where S.pk1 = T.ID and
            S.path_1_id is not null
      order by S.path_1_id
      ) = 314;

The top(1) and order by S.path_1_id is the culprit and it is [1] in the Xpath expression that is to blame.

I don't think it is possible for Microsoft to fix this with the current structure of the internal table even if you were allowed to leave out the [1] from the values() function. They would probably have to create multiple internal tables for each path expression with unique constraints in place to guarantee for the optimizer that there can only be one <number> element for each row. Not sure that would actually be enough for the optimizer to "break out of the apply pattern".

For you who think this fun and interesting and since you are still reading this you probably are.

Some queries to look at the structure of the internal table.

select T.name, 
       T.internal_type_desc, 
       object_name(T.parent_id) as parent_table_name
from sys.internal_tables as T
where T.parent_id = object_id('T');

select C.name as column_name, 
       C.column_id,
       T.name as type_name,
       C.max_length,
       C.is_sparse,
       C.is_nullable
from sys.columns as C
  inner join sys.types as T
    on C.user_type_id = T.user_type_id
where C.object_id in (
                     select T.object_id 
                     from sys.internal_tables as T 
                     where T.parent_id = object_id('T')
                     )
order by C.column_id;

select I.name as index_name,
       I.type_desc,
       I.is_unique,
       I.filter_definition,
       IC.key_ordinal,
       C.name as column_name, 
       C.column_id,
       T.name as type_name,
       C.max_length,
       I.is_unique,
       I.is_unique_constraint
from sys.indexes as I
  inner join sys.index_columns as IC
    on I.object_id = IC.object_id and
       I.index_id = IC.index_id
  inner join sys.columns as C
    on IC.column_id = C.column_id and
       IC.object_id = C.object_id
  inner join sys.types as T
    on C.user_type_id = T.user_type_id
where I.object_id in (
                     select T.object_id 
                     from sys.internal_tables as T 
                     where T.parent_id = object_id('T')
                     );

Sql-server – Why does the WHERE clause benefit from an “included” column

Your Predicate is different to your Seek Predicate.

A Seek Predicate is used to search the ordered data in the index. In this case, it'll be doing three seeks, one for each ItemState that you're interested in. Beyond that, the data is in ItemPriority order, so no further "Seek" operation can be done.

But before the data is returned, it checks every row using the Predicate, which I refer to as the Residual Predicate. It's done on the results of the Seek Predicate.

Any included column is not part of the ordered data, but can be used to satisfy the Residual Predicate, without having to do the extra Lookup.

You can see material I've written on this around Sargability. Check for a session at SQLBits in particular, at http://bit.ly/Sargability

Edit: To show the impact of Residuals better, run the query using the undocumented OPTION (QUERYTRACEON 9130), which will separate out the Residual into a separate Filter operator (which is actually an earlier version of the plan before the residual gets moved into the Seek operator). It clearly shows the impact of an ineffective Seek, by the number of rows being passed left to the Filter.

It's also worth noting that because of the IN clause on ItemState, the data being passed left is actually in ItemState order, not in ItemPriority order. A composite index on ItemState followed by one of the dates (eg (ItemState, LastAccessTime)) could be used to have three Seeks (notice the Seek Predicate shows three seeks within the one Seek operator), each against two levels, producing data that is still in ItemState order (eg, ItemState=3 and LastAccessTime less than something, then ItemState=9 and LastAccessTime less than something, and then ItemState=10 and LastAccessTime less than something).

An index on (ItemState, LastAccesTime, CreationTime) would be no more useful than one on (ItemState, LastAccessTime) because the CreationTime level is only useful if your Seek is for a particular ItemState and LastAccessTime combination, not a range. Like how the phone book isn't in FirstName order if you are interested in Surnames beginning in F.

If you want a composite index but you are never going to be able to use the later columns in Seek Predicates because of the way you use the earlier columns, then you may as well have them as included columns, where they take less space in the index (because they're only stored at the leaf level of the index, not the higher levels) but can still avoid lookups and get used in Residual predicates.

As per the term Residual Predicate - that's my own term for this property of a Seek. A Merge Join explicitly calls it its equivalent a Residual Predicate, and the Hash Match calls its one a Probe Residual (which you might get from TSA if you match for hash). But in a Seek they just call it Predicate which makes it seem less bad than it is.

Best Answer

Related Solutions

Sql-server – Why is the secondary selective index not used when the where clause filters on `value()`

Sql-server – Why does the WHERE clause benefit from an “included” column

Related Question