Ok, at this point I would be considering "manually" replacing statistics to tide me over. I haven't done this myself, but it should work in theory...
You can confirm that this would work as follows:
1. Restore the pre-upgrade database to your LIVE system and confirm that the query is fast in that database, and delete that database once done. (This test is to eliminate any additional environmental differences as the cause. If the query is slow, then the rest of this proposal may be useless.)
1a. Restore the pre-upgrade database to your test system and confirm that the query is fast.
2. Update statistics, and confirm that the query is slow (cancel after a couple minutes of course)
3. Set compat to 120 and confirm that the query is still slow
4. Set compat back to 100 and confirm that the query is still slow
5. Restore another copy of the pre-upgrade database (I'll refer to this as Rest2, and the earlier as Rest1)
6. Extract all statistics from Rest2 using the techniques at https://thomaslarock.com/2015/07/how-to-recreate-sql-server-statistics-in-a-different-environment/ for all tables involved in the problematic queries (or all tables if that's simpler)
7. Apply the statistics to Rest1, and see if query is now fast (you may need to dbcc freeproccache first).
If it works, then it's almost certainly safe to apply the statistics to your live database--just make sure you have ONLY statistics scripted. And also set its compat level to 100. You should then see the queries run fast (though you may need to dbcc freeproccache first--but consider the possible effect on live performance).
Note I'm assuming here (based on your original post) that you do not have statistics autoupdate turned on, and that your data changes slowly enough that your old statistics will do until you figure out how to get your workload working with compat 120 and/or updated statistics (and you may as well sort both at this point).
I understand your disappointment with the query plan regressions that you experienced. However, Microsoft changed some core assumptions about the cardinality estimator. They could not avoid some query plan regression. To quote Juergen Thomas:
However to state it pretty clearly as well, it was NOT a goal to avoid any regressions compared to the existing CE. The new SQL Server 2014 CE is NOT integrated following the principals of QFEs. This means our expectation is that the new SQL Server 2014 CE will create better plans for many queries, especially complex queries, but will also result in worse plans for some queries than the old CE resulted in.
To answer your first question, the optimizer appears to pick a worse plan with the new CE because of a 1 row cardinality estimate from Object2
. This makes a nested loop join very attractive to the optimizer. However, the actual number of rows returned from Object2
was 34182. This means that the estimated cost for the nested loop plan was an underestimate by about 30000X.
The legacy CE gives a 208.733 cardinality estimate from Object2
. This is still very far off, but it's enough to give a plan that uses a merge join a lower estimated cost than a nested loop join plan. SQL Server gave the nonclustered index seek on Object3
a cost of 0.0032831. With a nested loop plan under the legacy CE, we could expect a total cost for 208 index seeks to be about 0.0032831 * 208.733 = 0.68529 which is much higher than the final estimated subtree cost for the merge join plan, 0.0171922.
To answer your second question, the cardinality estimate formulas for a query as simple as yours are actually published by Microsoft. I recommend referencing the excellent white paper on differences between the legacy and new CE found here. Focus on why the cardinality estimates are 1 for the new CE and 208.733 for the legacy CE. That's unexpected because the legacy CE assumes independence of filters but the new CE uses exponential backoff. In general for such a query I would expect the new CE to give a larger cardinality estimate for Object2
. You should be able to figure what's going on by looking at the statistics on Object2
.
To answer your third question, we can get general strategies from the white paper. The following is an abbreviated quote:
- Retain the new CE setting if specific queries still benefit, and “design around” performance issues using alternative methods.
- Retain the new CE, and use trace flag 9481 for those queries that had performance degradations directly caused by the new CE.
- Revert to an older database compatibility level, and use trace flag 2312 for queries that had performance improvements using the new CE.
- Use fundamental cardinality estimate skew troubleshooting methods.
- Revert to the legacy CE entirely.
For your problem in particular, first I would focus on the statistics. It's not clear to me why the cardinality estimate for an index scan on Object3
would be so far off. I recommend updating statistics with FULLSCAN
on all of the involved objects and indexes before doing more tests. Updating the statistics again after changing the CE is also a good step. You should be able to use the white paper to figure out exactly why you're seeing the cardinality estimates that you're seeing.
I can give you more detailed help if you provide more information. I understand wanting to protect your IP but what you have there is a pretty simple query. Can you change the table and column names and provide the exact query text, relevant table DDL, index DDL, and information about the statistics?
If all else fails and you need to fix it without hints or trace flags, you could try updating to SQL Server 2016 or changing the indexes on your tables. You're unlikely to get the bad nested loop plan if you remove the index, but of course removing an index could affect other queries in a negative way.
Best Answer
Check out this part of the estimated plan for the slow query (second server):
This estimates that 21 rows will come out of the join to
RatesByCategory
(which is after_Results323
has been joined toAttributesTable
twice, andDataStreamDirection
). If we look at the equivalent part of the actual plan for the fast query:There will actually be ~16,000,000 rows coming out of that join. These is bad news, because those millions of rows flow right into a wall of Nested Loops joins, which is likely where that query slows to a crawl.
You were on the right track looking at estimates and trying to update stats, but this actually seems to be because of a row goal set by the semi-joins later in the execution plan.
Here's where the semi-join lives, and notice that everything "under it" has an estimate of 1 row:
Try adding this hint to the end of the query:
This will disable the row goal and should you give you a different plan (possibly similar to the one on the lower compatibility level).
Since you said this query is generated, your best bet for solving this might be to create a plan guide so that you can add that hint without having to change the application that produces the query.
Changing compat level to 120 would have enabled the "new" cardinality estimator, resulting in a different query plan - likely one that was less affected by the row goal (and thus why you were able to see it finish in 40 minutes, where the compat 110 query ran for 50 hours without finishing).