Well, the issue is now resolved:
While it would seem logical to use filtered indexes (NOT NULL), to reduce database size and as so many sources on the web say, increase performance, the reality it seems is something else entirely.
In layman's terms, SQL Server query planner resolves even your basic inner joins without making any assumptions as to the content of the columns. Even though NULL values do not form a join, they must be included in the column index in order for query planner to use it, unless otherwise specified with predicates such as WHERE joinCol_ID IS NOT NULL. Basically, SQL Server does not use filtered indexes for joins at all, unless the queries themselves are modified to account for the filter value. Instead, it will create new statistics on these columns and / or use a clustered index scan or other indexes including the column, whichever it deems most effective. Using filtered indexes on foreign keys is therefore an absolutely horrid idea.
We still have no idea how months worth of testing this in multiple other environments never produced the same results outside of this one, single DB, but this is the way it's supposed to work. Apparently something that as far as we know is not related to cache, statistics or configurations, caused the production DB to behave differently and correctly detect and use the filtered indexes, while all of the testing environments simply used the old indexes (seeing as the indexes were dropped and recreated with the same name, this seems a valid theory even if there is no real proof).
So the lesson of the story: The web is filled with examples of how underused filtered indexes are, how awesome they can be. But this serious downside never popped up except as a nagging thought in the back of my head saying "if these are so great, then why aren't NULL values filtered out of indexes by default, since they only take up space and only serve a purpose in special circumstances"? Well, now I know why. :)
The first thing I would suggest you check is your "max degree of parallelism" via sp_configure. By default this is set to 0, so a parallel query will consume all CPU on the box (up to 64 threads) until it finishes if this setting is unchanged. I've seen this occur on more systems than I care to admit over my career, and it tends to be even more frustrating on the larger servers (with a higher number of CPUs). If this is set to the default of 0, the recommendation from MS on what this should be can be found here: http://support.microsoft.com/kb/2806535
If it's a smaller server with very few CPUs (e.g. 1 or 2) it could be something as trivial as auto-update statistics kicking in, though that has been reportedly rare... but rare still happens. You'd have to check if that is the culprit via a profiler trace to be certain though.
Best Answer
Both missing indexes and long-running queries are available in DMVs, with the caveat that DMVs can get refreshed on some events, such as service restart, certain sp_configure changes, etc.
Here's a missing index query from Bart Duncan's blog post, Are you using SQL's Missing Index DMVs?:
I would change about 20 things in that query, but it's a start. This will show you the indexes you should investigate creating - but please don't just create all of the indexes suggested. I've blogged about why you don't want to do that.
And here is a query from the
sys.dm_exec_query_stats
documentation:Also some bad habits here, but shrug. This will collect the information about your longest-running queries in aggregate. If you want to know about individual instances of long-running queries, you'll have to capture those yourself using a server-side trace with a duration filter, or extended events, or a 3rd party monitoring product (I'm biased because I worked there for nearly a decade).