Sql-server – filtered index null and > 0 disparity — need explanation

performancesql-server-2008

I'm getting behavior in a query plan I cannot explain. The difference is between two filtered indexes I'm testing with. One uses a where id is not null and the other uses where id > 0. In my actual data I get a 95% favorable runtime using the > 0 index. I can't see why they would be different… my keys are auto incrementing integers starting at 1 referenced by the adjoining table on a nullable column. Below is a script that will generate structures and data analogous to my production data.

Please note the following… if you execute these scripts and then compare the two select statements, you will probably get fifty fifty performance like I did. However in my production data the query that utilizes the > 0 filtered index chooses an index scan instead of seek. This scan runs much faster. Below is a screenshot of my actual query plan comparison. Secondly, I have rebuilt these indexes, fragmentation is not an issue.

Questions:
Why the disparity? Where does the scan vs seek come from? Wouldn't is not null and > 0 be equivalent in a join when the datatype is int identity(1,1)?

enter image description here

Schema + Data:

if exists (select * from sys.tables where name = 'sometable')
begin drop table sometable end
go

create table sometable (id int not null identity(1,1) primary key
                      , value nvarchar(50));
go

insert into sometable values ('a test value');
insert into sometable values ('a test value');
insert into sometable values ('a test value');
insert into sometable values ('a test value');
insert into sometable values ('a test value');
go

if exists (select * from sys.tables where name = 'audit')
begin drop table audit end
go

create table audit (id int not null identity(1,1) primary key
                  , sometable_id int null
                  , someothertable_id int null
                  , auditvalue nvarchar(50));
go

declare @count int = 0;
while (@count < 40000)
begin
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'a sometable audit');
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'another sometable audit');
insert into audit (sometable_id,someothertable_id,auditvalue) values (null,1,'irrelevant other table audit');
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'another audit for record one sometable');
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'record three audit');
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'another record 3 audit');
insert into audit (sometable_id,someothertable_id,auditvalue) values (null,50,'irrelevant1');
insert into audit (sometable_id,someothertable_id,auditvalue) values (null,51,'irrelevant2');
insert into audit (sometable_id,someothertable_id,auditvalue) values (null,52,'irrelevant3');
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'relevant fourth record');
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'back to one record');
insert into audit (sometable_id,someothertable_id,auditvalue) values (null,53,'irrelevant 4');
insert into audit (sometable_id,someothertable_id,auditvalue) values (null,54,'irrelevant fifth record');
insert into audit (sometable_id,someothertable_id,auditvalue) values (null,55,'irrelevant sixth record');
insert into audit (sometable_id,someothertable_id,auditvalue) values (null,56,'irrelevant seventh record');
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'a fifth record audit');
insert into audit (sometable_id,someothertable_id,auditvalue) values (floor(rand()*5+1),null,'another fourth record audit');
set @count = (@count + 1);
end
go

--drop index audit.filter_null_audits_for_sometable
create index filter_null_audits_for_sometable on audit(sometable_id,id) include(auditvalue) where sometable_id is not null;
go
--drop index audit.filter_upzero_audits_for_sometable
create index filter_upzero_audits_for_sometable on audit(sometable_id,id) include(auditvalue) where sometable_id > 0;
go

Two Queries:

select top 50000 a.id sometableid,a.value,b.id auditid,b.auditvalue
  from sometable a
  join audit b with(index(filter_null_audits_for_sometable)) on a.id = b.sometable_id

select top 50000 a.id sometableid,a.value,b.id auditid,b.auditvalue
  from sometable a
  join audit b with(index(filter_upzero_audits_for_sometable)) on a.id = b.sometable_id and b.sometable_id > 0

Update1

I copied production data into my test tables. Instead of fifty fifty, the results matched the included query plan and reproduced the disparity. The test scenario is structurally analogous.

Update2

select a.id sometableid,a.value,b.id auditid,b.auditvalue
  from sometable a
 inner merge join audit b with(index(filter_null_audits_for_sometable)) 
    on a.id = b.sometable_id

select a.id sometableid,a.value,b.id auditid,b.auditvalue
  from sometable a
  join audit b with(index(filter_upzero_audits_for_sometable)) 
    on a.id = b.sometable_id

These query plans will not compile. Why? They force me to use > 0 as query join conditions to get the optimized plan.

Best Answer

There are a bunch of questions hidden in here.

Let's start with, why the logically equivalent >0 yields a scan instead of a seek.

The index is filtered on IS NOT NULL, so SQL Server knows that no value is NULL in this index. However, it does not know that all values are >0 because an IDENTITY property is not a constraint. So, to return all rows >0 SQL Server has to find the first row that actually is >0 (using a seek operation) and then scan the rest of the rows from there on. If that scan is covering most of the table, spending the additional reads for the first seek might be more expensive than just scanning the full table. Both approaches will lead to the same result in all cases. Which one sql server picks is based on statistics.

So, why is the scan faster?

SQL Server has three physical join operators to choose from: The Nested Loops Join, the Hash Join and the Merge Join. The Merge Join is by far the fastest, however it requires that both inputs are sorted the same way. For more details on the join operators check out my JOIN series here: http://sqlity.net/en/1146/a-join-a-day-introduction/ The seek at the bottom of the Nested Loop Join operator get's executed for each row of the top input. That is a lot of seeks and they can become costly quickly. SQL Server usually figures out where the so called "tipping point" lies at which a full table scan is cheaper then a repeated seek. However, that decision is based on statistics again and often off by a few hundred rows.

If the Merge join is faster and the data is already sorted, why does SQL Server not use it for both queries?

That is a good question... Usually bad decisions by the optimizer are cause by stale statistics (on any table involved in the query). However, the optimizer has limitations and dealing with advanced features like filtered indexes might force it out of its comfort zone. If updating statistics does not help we need to help the optimizer. There are several ways that can be done. You could flat out provide a MERGE JOIN hint. For this query that might even be safe as the query will always have to look at all rows in the child table. However, in general, hints should be considered an absolute last resort as they prevent the optimizer from adapting to changing data. You could also try to rewrite the query in a way that gets the optimizer started on a different path leading to a better result. That is kind of comparable to hinting, only hidden - usually even worse a strategy. However, if you think back to pre 2005 times, it was very common to utilize query rewrites just to get the optimizer to end up with a better plan. The advantage of this approach is that the next version of the optimizer might recognize the rewritten query to be equivalent to the original so that this technique does not have longterm effects. (That could also be bad...)

An other approach altogether is to provide additional information to the optimizer. If you have complex queries the detail of the statistics might just not be fine-grained enough for the optimizer to make a good choice. In that case it often helps to create a filtered index or statistics object. In your case we are looking at a filtered index already. What might be missing is the information that all rows are actually >0. So it might help to add both conditions to the index filter.

While this all does not give you a real solution, I hope it helps understanding what is going on here. To solve this, I would start with adding that second condition to the same index and see where that leads you.


On a different note: You mention a 50-50 comparison of two execution plans. I assume you are referring to the "Query cost (relative to batch)". While SQL Server displays this information in a very suggestive way, you can never rely on it. This number is helpful to sql server internally when comparing different execution plans for the same query. However it is absolutely useless when comparing two nonidentical queries. It is best to just ignore this value when talking about performance. Use information you can get back from STATISTICS IO and STATISTICS TIME instead.


UPDATE to address your questions:

1) To retrieve the rows for any query, SQL Server can always execute a scan. In the right circumstances a seek can be much faster. I was describing what SQL Server had to do to execute a seek on the audit table to get the rows >0. If SQL Server thinks that would result in more reads then a scan (as in this case), it is opting for the scan.

2)To make the merge join hint work you need to also tell the optimizer that there actually can't be a value <0 in the table. In your example you could do that like this:

ALTER TABLE sometable ADD CONSTRAINT sometable_id_upzero CHECK  (id>0);

Because of the existing foreign key this constraint applies to both tables. If you now execute

SELECT TOP 50000
        a.id sometableid,
        a.value,
        b.id auditid,
        b.auditvalue
FROM    sometable a
INNER MERGE JOIN audit b
        ON a.id = b.sometable_id;

SQL Server will use this execution plan:

enter image description here

That is the one we are aiming for. It is still using the filter_null_audits index, but allowing for the merge join now. Not sure why this constraint would change if the MERGE join could be used; it seems to be an optimizer limitation. Again, filtered indexes are a fairly advanced option for the optimizer and the implementation might not be complete yet.

3) The advice about not trusting the relative query cost is general in nature and did not imply that the numbers are actually wrong for this particular query. I can easily construct two queries where one shows a relative cost of 99% but the other one actually runs a hundred times slower. If you have an example at hand with correct relative costs, that just means you got lucky. :)

Hope that clarified things.