I'm not aware of any documentation. I did look into this and make some observations however that are too long for a comment.
The 10% estimate is not always a degradation. Take the following example.
TRUNCATE TABLE dbo.StringTest
INSERT INTO dbo.StringTest
SELECT TOP (1000000) 'ZZ_' + LEFT(NEWID(), 12)
FROM master..spt_values v1,
master..spt_values v2;
and the WHERE
clause in your question.
WHERE SUBSTRING(TheString, 1, CHARINDEX('_',TheString) - 1) = 'ZZ'
The table contains a million rows. All of them match the predicate. Under compat level 130 the 10% guess yields an estimate of 100,000. Under 120 the estimated rows is 1.03913.
The 120 behaviour uses the histogram but only to get the number of distinct rows. The density vector in my case shows 1.039131E-06 and this is multiplied by the table cardinality to get the estimated row count. All of the values are in fact different but all match the predicate.
Tracing the query_optimizer_estimate_cardinality
extended event shows that under 130 there are two different <StatsCollection Name="CStCollFilter"
events. The first one estimates 100,000. The second one loads the histogram and uses the CSelCalcPointPredsFreqBased/DistinctCountCalculator to get the 1.04 estimate. This second result appears unused.
The behavior that you observed is not consistently applied in 130. I added ORDER BY TheString
expecting this to be a clear win for the 130 estimator as the 120 struggles on with a memory grant for one row but this minor change was sufficient to bring the estimated rows down to 1.03913 in the 130 case too.
Adding OPTION (QUERYRULEOFF SelectToFilter)
reverts the estimate going into the sort to 100,000 but the memory grant doesn't increase and the estimates coming out the sort are still based on the table distinct values.
Similarly tweaking the cost threshold for parallelism so that the query gets a parallel plan was sufficient in the 130 case to revert to the lower estimate. Adding QUERYTRACEON 8757
also causes the lower estimate. It looks like the 10% estimate is only retained for trivial plans.
Your proposed rewrite with
WHERE TheString LIKE 'ZZ[_]%'
Shows much superior estimates to both. The output for this is
CSelCalcTrieBased
Column: QCOL: [MyStringTestDB].[dbo].[StringTest].TheString
Showing that it used tries. More info about this is in the string summary statistics section just above here.
It is not the same as your original query however. As the first instance of _
is now assumed to always be the third character rather than being found dynamically.
If this assumption is hardcoded into your original query
WHERE SUBSTRING(TheString, 1, 3) = 'ZZ_'
The estimation method changes to CSelCalcHistogramComparison(INTERVAL)
and the estimated rows become accurate.
It is able to convert that into a range
WHERE TheString >= 'ZZ_' AND TheString < ???
and use the histogram to estimate the number of rows with values in that range.
This applies only to the cardinality estimation however. LIKE
is preferable as it can use a range seek at runtime. SUBSTRING(TheString, 1, 3)
or LEFT(TheString, 3)
can't.
Best Answer
It will not behave exactly the same. Compatibility mode works at the database level, not the Instance level, which is still 2017. But it will mostly work the same.
I would check the breaking feature list for 2017 as some Instance level changes may still affect your code, despite it being in 2016 compatibility mode. However, the breaking changes from 2017 are relatively minor so it is probably unlikely you'd be affected. In general, some version breaking changes are covered under compatibility mode and some are not.
The documentation gives good examples of this,
and
You should also look at the deprecated feature list for 2017 and try to remove any of those areas from future development in order to future-proof your applications. And of course, any new features from 2017 may require updates to take advantage of them as well, but I get the impression you're more concerned about breaking changes.
With all that said, you're probably fine, but should still carefully and thoroughly test upgrading the compatibility mode before moving it to production.