Sql-server – Cardinality Estimate for LIKE operator (Local Variables)

cardinality-estimatesoptimizationsql serversql server 2014statistics

I was under the impression that when using the LIKE operator in all optimise for unknown scenarios both the legacy and new CEs use a 9% estimate (assuming that relevant statistics are available and the query optimiser doesn't have to resort to selectivity guesses).

When executing the below query against the credit database I get different estimates under the different CEs. Under the new CE I receive an estimate of 900 rows which I was expecting, under the legacy CE I receive an estimate of 241.416 and I can't figure out how this estimate is derived. Is anyone able to shed any light?

-- New CE (Estimate = 900)
DECLARE @LastName VARCHAR(15) = 'BA%'
SELECT * FROM [Credit].[dbo].[member]
WHERE [lastname] LIKE @LastName;

-- Forcing Legacy CE (Estimate = 241.416)
DECLARE @LastName VARCHAR(15) = 'BA%'
SELECT * FROM [Credit].[dbo].[member]
WHERE [lastname] LIKE @LastName
OPTION (
QUERYTRACEON 9481,
QUERYTRACEON 9292,
QUERYTRACEON 9204,
QUERYTRACEON 3604
);

In my scenario, I already have the credit database set to compatibility level 120, hence why in the second query I'm using trace flags to force the legacy CE and to also provide information on what statistics are used/considered by the query optimiser. I can see the column statistics on 'lastname' are being used but I still can't work out how the estimate of 241.416 is derived.

I couldn't find anything online other than this Itzik Ben-Gan article, which states "When using the LIKE predicate in all optimize for unknown scenarios both the legacy and new CEs use a 9 percent estimate.". The information in that post would appear to be incorrect.

Best Answer

The guess for LIKE in your case is based on:

G: The standard 9% guess (sqllang!x_Selectivity_Like)
M: A factor of 6 (magic number)
D: Average data length in bytes (from statistics), rounded down to integer

Specifically, sqllang!CCardUtilSQL7::ProbLikeGuess uses:

Selectivity (S) = G / M * LOG(D)

Notes:

The LOG(D) term is omitted if D is between 1 and 2.
If D is less than 1 (including for missing or NULL statistics):
D = FLOOR(0.5 * maximum column byte length)

This sort of quirkiness and complexity is quite typical of the original CE.

In the question example, the average length is 5 (5.6154 from DBCC SHOW_STATISTICS rounded down):

Estimate = 10,000 * (0.09 / 6 * LOG(5)) = 241.416

Other example values:

 D  = Estimate using formula for S
 15 = 406.208
 14 = 395.859
 13 = 384.742
 12 = 372.736
 11 = 359.684
 10 = 345.388
 09 = 329.584
 08 = 311.916
 07 = 291.887
 06 = 268.764
 05 = 241.416
 04 = 207.944
 03 = 164.792
 02 = 150.000 (LOG not used)
 01 = 150.000 (LOG not used)
 00 = 291.887 (LOG 7) /* FLOOR(0.5 * 15) [15 since lastname is varchar(15)] */

Test rig

DECLARE
    @CharLength integer = 5, -- Set length here
    @Counter integer = 1;

CREATE TABLE #T (c1 varchar(15) NULL);

-- Add 10,000 rows
SET NOCOUNT ON;
SET STATISTICS XML OFF;

BEGIN TRANSACTION;
WHILE @Counter <= 10000
BEGIN
    INSERT #T (c1) VALUES (REPLICATE('X', @CharLength));
    SET @Counter = @Counter + 1;
END;
COMMIT TRANSACTION;

SET NOCOUNT OFF;
SET STATISTICS XML ON;

-- Test query
DECLARE @Like varchar(15);
SELECT * FROM #T AS T 
WHERE T.c1 LIKE @Like;

DROP TABLE #T;

Related Solutions

Sql-server – Warning in query plan “Cardinality Estimate”

This warning was new for SQL Server 2012.

From New "Type Conversion in Expression....." warning in SQL2012 ,to noisy to practical use

I see what you mean. While I agree that this is noise in most cases, it is low priority for us to fix. We will look at it if we get more feedback. For now I have closed this by design.

Connect was killed and it doesn’t look like the original issue was transferred to UserVoice. Here’s a different UserVoice issue about the same problem, Type conversion in may affect CardinalityEstimate - Convert/cast on selected columns

I will provide the boring answer until someone comes along with a better one.

Why does it have the warning?

Speculation on my part.
There is a cast on a column that is used in the where clause which make statistics of that column interesting. A change of datatype makes the statistics no good so lets warn about that in case the value from the field list might end up to be used somewhere.

How could a cast in the field list affect the cardinality estimate?

It can't unless it is the field list in a derived table.

Sql-server – How to improve row estimate of 1 row in join for newly inserted data

Q1) Why mathematically is the original estimate so bad? I mean the CacheId's are sparse but not at a ratio of 20000:1.

Here is the rule to trigger auto update the stats Statistical maintenance functionality (autostats) in SQL Server:

The above algorithm can be summarised in the form of a table:

Table Type | Empty Condition | Threshold When Empty |Threshold When Not Empty

Permanent | < 500 rows | # of Changes >= 500 | # of Changes >= 500 + (20% of Cardinality)

Even thought the KB point to 2000, it's still true up to 2012.

Run through this scenario and see for yourself.

STEP#1

SET STATISTICS IO OFF;
GO
SET NOCOUNT ON;
GO
-- make sure the Include Actual Execution Plan is off!!!
IF OBJECT_ID('IDs') IS NOT NULL
DROP TABLE dbo.IDs;

CREATE TABLE IDs
(
ID tinyint NOT NULL
)

INSERT INTO IDs
SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7;

IF OBJECT_ID('TestStats') IS NOT NULL
DROP TABLE dbo.TestStats;

CREATE TABLE dbo.TestStats
(
 ID tinyint NOT NULL,
 Col1 int NOT NULL,
 CONSTRAINT PK_TestStats PRIMARY KEY CLUSTERED (ID, col1)
);

DECLARE @id int = 1
DECLARE @i int = 1

WHILE @id <= 6
BEGIN
 SET @i = 1

WHILE @i <= 20247
BEGIN
    INSERT INTO dbo.TestStats VALUES(@id,@i);

    SET @i = @i + 1
END

SET @id = @id + 1
END

-- so far so good!
SELECT ID, COUNT(*) AS RowCnt FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Now we have a table with IDs 1 through 6 and each ID has 20247 rows. Stats look good so far!

STEP#2

-- now insert another ID = 7 with 20247 rows
DECLARE @i int = 1;

WHILE @i <= 20247
BEGIN
  INSERT INTO dbo.TestStats VALUES(7,@i);

  SET @i = @i + 1
END

-- see the problem with the histogram?
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Look at the table and histogram! The actual table has ID = 7 with 20247 rows but the histogram has no idea that you've just inserted the new data because the auto update didn't trigger. According the the formula you need to insert (20247 * 6) * 0.2 + 500 = 24,796.4 rows to trigger an auto update for stats on this table.

Thus, if you look at the plans for these queries you see the wrong estimates:

-- CTRL + M to include the Actual Execution plan
-- now, IF we run these queries, the Optimizer has no info about ID = 7
-- and the Estimates 1 because it cannot say 0.
SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 1;

SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 7;

Query #1:

Query #2:

Query #2

The Optimize cannot say 0 rows, so it just shows you 1.

STEP#3

-- now we manually update the stats
UPDATE STATISTICS dbo.TestStats WITH FULLSCAN;

-- check the histogram
DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

-- rerun the queries
SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 1;

SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 7;

Now the histogram show the missing ID 7 and the execution plans show the right estimates as well.

Query #1:

Query #1

Query #2:

Query #2

Q2) As the number of cacheId's increases would you expect the estimates for newly inserted data improve naturally?

Yes, as soon as you pass the threshold of 20% + 500 from the total rows. The auto update will trigger. You can run though this scenario by re-running STEP#1, but then modify STEP#2 by running these queries:

-- now insert another ID = 7 with 20247 rows
DECLARE @i int = 1;

WHILE @i <= 20247
BEGIN
   INSERT INTO dbo.TestStats VALUES(7,@i);

   SET @i = @i + 1
END

-- see the problem with the histogram?
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;
GO
-- try to insert ID = 8 to trigger the auto update for the stats
DECLARE @i int = 1;

WHILE @i <= 4548
BEGIN
  INSERT INTO dbo.TestStats VALUES(8,@i);

  SET @i = @i + 1
END

-- no update yet
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

No update yet because the threshold is 24,796.4 - 20247 = 4549.4 but we inserted only 4548 rows for ID 8. Now insert this one row and double check the histogram:

-- this will trigger the update
INSERT INTO dbo.TestStats VALUES(8,4549);

-- double check
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Q3) Are there any ways (gulp, tricks or otherwise) to improve the estimate (or make it less certain of 1 row) without having to update the statistics every time a new set of data is inserted (e.g. adding a fake data set at a much larger CacheId = 999999).

Controlling Autostat (AUTO_UPDATE_STATISTICS) behavior in SQL Server

However, when a table becomes very large, the old threshold (a fixed rate – 20% of rows changed) may be too high and the Autostat process may not be triggered frequently enough. This could lead to potential performance problems. SQL Server 2008 R2 Service Pack 1 and later versions introduce trace flag 2371 that you can enable to change this default behavior. The higher the number of rows in a table, the lower the threshold will become to trigger an update of the statistics. For example, if the trace flag is activated, update statistics will be triggered on a table with 1 billion rows when 1 million changes occur. If the trace flag is not activated, then the same table with 1 billion records would need 200 million changes before an update statistics is triggered.

Hope this helped you to understand! Pretty good question!

Best Answer

Test rig

Related Solutions

Sql-server – Warning in query plan “Cardinality Estimate”

Sql-server – How to improve row estimate of 1 row in join for newly inserted data

Related Question