Sql-server – SQL Server 2008 query planner failing after index drop & recreate

execution-planfiltered-indexindexsql serversql-server-2008

Recently we ran a script to our production DB that would dynamically drop and recreate hundreds of indexes as filtered indexes. While this script had run perfectly in all other previous tests, now after this one the SQL Server query plan cache is behaving oddly.

Execution planner reveals that wrong indexes are being used wrongly that return millions of rows, when only a few should match. When the indexes are correct, the execution plan shows that SQL Server resolves to use index Scan as opposed to index Seek providing much less than optimal results. With tablehints like WITH(INDEX(indexname)) or WITH(FORCESEEK) in the proper places, this can be corrected. INNER LOOP JOIN also fixes some of these.

However, the problem is that even when these new, filtered indexes are dropped and recreated as they were before, the query plan remains the same. Query plan cache has been cleared, DB has been restored to a different environment, statistics have been updated and obviously the indexes have been rebuilt so they're not fragmented.

This is currently a critical issue that nobody has any idea how to fix. While we can force SQL Server to use the correct plans, it's simply not a solution to the multitude of software that would have to be updated with it, and obviously a DB where you need to manually point out how to handle queries is not an option.

So any help would be greatly appreciated.

Edit:
We managed to fix one query by dropping the indexes, recreating them again as filtered, then running an UPDATE STATISTICS tablename WITH FULLSCAN. This fixed some of the problem and two joins were working correctly. After this we had to do a separate alter to a multi-column index that was in no way involved in the original index script, to include one of the columns used in the join. These two changes together enabled the query planner to resolve to use the correct indexes with Seek rather than Scan.

The theory right now is that due to an earlier crash, the DB was apparently restored by deleting the underlying DB and then creating a new one from backup, rather than just using REPLACE as before. This would have somehow disconnected masterdb's metadata such as execution plans, all cache and whatnot from the DB resulting in a new massive database with no existing plans to handle queries. This, grouped with an apparently failed statistics update on the newly created indexes would have produced a scenario where SQL Server had no idea how to resolve the various queries it was being bombarded with.

I'm not convinced however that this will still be enough, since some of the behavior such as having to alter the multi-column index, and the fact that no statistics update was ever necessary in any of the prior testing environments in the past 2 months of testing, seems to let on that there's something else that has gone wrong.

Best Answer

Well, the issue is now resolved:

While it would seem logical to use filtered indexes (NOT NULL), to reduce database size and as so many sources on the web say, increase performance, the reality it seems is something else entirely.

In layman's terms, SQL Server query planner resolves even your basic inner joins without making any assumptions as to the content of the columns. Even though NULL values do not form a join, they must be included in the column index in order for query planner to use it, unless otherwise specified with predicates such as WHERE joinCol_ID IS NOT NULL. Basically, SQL Server does not use filtered indexes for joins at all, unless the queries themselves are modified to account for the filter value. Instead, it will create new statistics on these columns and / or use a clustered index scan or other indexes including the column, whichever it deems most effective. Using filtered indexes on foreign keys is therefore an absolutely horrid idea.

We still have no idea how months worth of testing this in multiple other environments never produced the same results outside of this one, single DB, but this is the way it's supposed to work. Apparently something that as far as we know is not related to cache, statistics or configurations, caused the production DB to behave differently and correctly detect and use the filtered indexes, while all of the testing environments simply used the old indexes (seeing as the indexes were dropped and recreated with the same name, this seems a valid theory even if there is no real proof).

So the lesson of the story: The web is filled with examples of how underused filtered indexes are, how awesome they can be. But this serious downside never popped up except as a nagging thought in the back of my head saying "if these are so great, then why aren't NULL values filtered out of indexes by default, since they only take up space and only serve a purpose in special circumstances"? Well, now I know why. :)