Sql-server – Why is an additional filtered statistic being ignored (EAV schema)

cardinality-estimateseavsql serverstatistics

I'm trying to improve a row estimate for this sub-query (of a larger query). The estimate is showing 1266 rows. The actual is 117k rows. This particular property (EAV schema) only has two values defined for it (2 and 3):

declare @pPropVal smallint = 2;

select Value, ObjectId 
  from Oav.ValueArray PropName
 where PropName.PropertyId = 897
   and PropName.Value  = @pPropVal
option (recompile)

The query plan shows the proper seek predicate on index IX_ValueArray_PropValObj on PropertyId and Value as expected.

(A) As an attempt to improve row estimates, an additional statistic was added which brought the row estimate up slightly to 3041:

create statistics [ST_SomePropertyName] ON [Oav].[ValueArray](PropertyId, Value, ObjectId)
 where 
     (     
             PropertyId = 897 
         and [Value] is not null
     )
  with fullscan

The histogram shows a single row. The HI key is just the PropertyId (the first column) which is not that useful so as I understand it, it is using the density information.

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS  DISTINCT_RANGE_ROWS  AVG_RANGE_ROWS
897             0           196026   0                    1

All density Average Length  Columns
1           4               PropertyId
0.5         8               PropertyId, Value

Name    Updated Rows    Rows Sampled    Steps   Density Average key length  String Index    Filter Expression   Unfiltered Rows
ST_SomePropertyName May 20 2014  2:01PM 196026  196026  1   0   8   NO  ([PropertyId]=(897) AND [Value] IS NOT NULL)    9317055

(B) Since there is a filter on PropertyId = 897, I thought I could re-create the statistic like this:

create statistics [ST_SomePropertyName] ON [Oav].[ValueArray](Value, ObjectId)
where
    (       
       PropertyId = 897 
       and [Value] is not null
    )
 with fullscan

The histogram looks useful to my eyes but the estimator appears to be ignoring it because it reverts to the original estimate of 1266.

RANGE_HI_KEY  RANGE_ROWS  EQ_ROWS  DISTINCT_RANGE_ROWS   AVG_RANGE_ROWS
2             0           117760   0                     1
3             0           78266    0                     1

All density   Average Length  Columns
0.5           4               Value
5.101364E-06  12              Value, ObjectId

Name    Updated Rows    Rows Sampled    Steps   Density Average key length  String Index    Filter Expression   Unfiltered Rows
ST_SomePropertyName May 20 2014  2:04PM 196026  196026  2   0   12  NO  ([PropertyId]=(897) AND [Value] IS NOT NULL)    9317055

(C) It does work to filter to a fixed value (and not even need the 2nd two columns) but that is not a very practical solution. This gave the exact estimate 117k.

create statistics [ST_SomePropertyName] ON [Oav].[ValueArray](PropertyId)
 where 
     (     
             PropertyId = 897 
         and [Value] = 2
     )
  with fullscan

histogram:

RANGE_HI_KEY   RANGE_ROWS  EQ_ROWS  DISTINCT_RANGE_ROWS  AVG_RANGE_ROWS
897            0           117760   0                    1

(D) (Added to original question) An approach of limiting the value to a smaller range helps. But if the range values are not uniform or the value was a string based field or not even known this may not be a good workaround in general:

CREATE STATISTICS [ST_ListUnderBrand_897] ON [Oav].[ValueArray](PropertyId, Value)
WHERE 
  (       
      PropertyId = 897 
      and [Value] >= 1 and [Value] <= 20
  )
  with fullscan

This gives estimates of about 16k. Changing the [1,20] to the exact [2,3] gives estimates of ~80k. It seems obvious that the true range of Values from the table data are not really used (since it is 2nd column) and this is some estimate based mostly on the filter range.

Please note the Value field is a sql_variant but I don't think that is related as the query plan does not show any implicit conversions.

Why doesn't SQL Server use the statistics from B? Should it?

Are there other options available to fix this?

Best Answer

Filtered indexes and statistics won't come into play when you're using local variables, unless you use the OPTION (RECOMPILE) query hint, and are running SQL Server 2008 R2 or later.

Tim Chapman's MSDN blog post explains with examples.

Related Solutions

Sql-server – Estimated vs. Actual rows and multi-column statistics

You've actually got a few questions in here, so I'll break 'em out individually.

My problem/question is that the estimated number of rows for the above query is 256, the actual number of rows is 560K. I want to understand why there is such a big difference between these two numbers?

In order to answer that question, the first thing we would need is the actual execution plan for the query. In SQL Server Management Studio, you can get that by clicking Query, Include Actual Execution Plan. Run the query, click on the Execution Plan tab, and right-click anywhere in the whitespace to click Save Plan As. Save that, and post it somewhere for people to download and examine.

The next thing we would need is the output from DBCC SHOW_STATISTICS for the stats on that table. You've hinted at the output, and that's a good start, but the raw output will help us understand exactly what's going on.

If I run a DBCC SHOW_Statistics, the density section has both columns in it, the histogram does not. Does SQL server produce a histogram for the combination of columns in multi-column statistics?

No.

I have an index on (TaskExecStatusID,TaskExecUpdatedDate),

If you frequently use the query in the example (with TaskExecUpdateDate IS NULL), then you might check out filtered indexes. They're a new feature in SQL Server 2008 that allows you to put a where clause on your index, basically.

http://sqlfool.com/2009/04/filtered-indexes-what-you-need-to-know/

SQL Server – How to Improve Row Estimate for Newly Inserted Data in Join

Q1) Why mathematically is the original estimate so bad? I mean the CacheId's are sparse but not at a ratio of 20000:1.

Here is the rule to trigger auto update the stats Statistical maintenance functionality (autostats) in SQL Server:

The above algorithm can be summarised in the form of a table:

Table Type | Empty Condition | Threshold When Empty |Threshold When Not Empty

Permanent | < 500 rows | # of Changes >= 500 | # of Changes >= 500 + (20% of Cardinality)

Even thought the KB point to 2000, it's still true up to 2012.

Run through this scenario and see for yourself.

STEP#1

SET STATISTICS IO OFF;
GO
SET NOCOUNT ON;
GO
-- make sure the Include Actual Execution Plan is off!!!
IF OBJECT_ID('IDs') IS NOT NULL
DROP TABLE dbo.IDs;

CREATE TABLE IDs
(
ID tinyint NOT NULL
)

INSERT INTO IDs
SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7;

IF OBJECT_ID('TestStats') IS NOT NULL
DROP TABLE dbo.TestStats;

CREATE TABLE dbo.TestStats
(
 ID tinyint NOT NULL,
 Col1 int NOT NULL,
 CONSTRAINT PK_TestStats PRIMARY KEY CLUSTERED (ID, col1)
);

DECLARE @id int = 1
DECLARE @i int = 1

WHILE @id <= 6
BEGIN
 SET @i = 1

WHILE @i <= 20247
BEGIN
    INSERT INTO dbo.TestStats VALUES(@id,@i);

    SET @i = @i + 1
END

SET @id = @id + 1
END

-- so far so good!
SELECT ID, COUNT(*) AS RowCnt FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Now we have a table with IDs 1 through 6 and each ID has 20247 rows. Stats look good so far!

STEP#2

-- now insert another ID = 7 with 20247 rows
DECLARE @i int = 1;

WHILE @i <= 20247
BEGIN
  INSERT INTO dbo.TestStats VALUES(7,@i);

  SET @i = @i + 1
END

-- see the problem with the histogram?
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Look at the table and histogram! The actual table has ID = 7 with 20247 rows but the histogram has no idea that you've just inserted the new data because the auto update didn't trigger. According the the formula you need to insert (20247 * 6) * 0.2 + 500 = 24,796.4 rows to trigger an auto update for stats on this table.

Thus, if you look at the plans for these queries you see the wrong estimates:

-- CTRL + M to include the Actual Execution plan
-- now, IF we run these queries, the Optimizer has no info about ID = 7
-- and the Estimates 1 because it cannot say 0.
SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 1;

SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 7;

Query #1:

Query #2:

Query #2

The Optimize cannot say 0 rows, so it just shows you 1.

STEP#3

-- now we manually update the stats
UPDATE STATISTICS dbo.TestStats WITH FULLSCAN;

-- check the histogram
DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

-- rerun the queries
SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 1;

SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 7;

Now the histogram show the missing ID 7 and the execution plans show the right estimates as well.

Query #1:

Query #1

Query #2:

Query #2

Q2) As the number of cacheId's increases would you expect the estimates for newly inserted data improve naturally?

Yes, as soon as you pass the threshold of 20% + 500 from the total rows. The auto update will trigger. You can run though this scenario by re-running STEP#1, but then modify STEP#2 by running these queries:

-- now insert another ID = 7 with 20247 rows
DECLARE @i int = 1;

WHILE @i <= 20247
BEGIN
   INSERT INTO dbo.TestStats VALUES(7,@i);

   SET @i = @i + 1
END

-- see the problem with the histogram?
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;
GO
-- try to insert ID = 8 to trigger the auto update for the stats
DECLARE @i int = 1;

WHILE @i <= 4548
BEGIN
  INSERT INTO dbo.TestStats VALUES(8,@i);

  SET @i = @i + 1
END

-- no update yet
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

No update yet because the threshold is 24,796.4 - 20247 = 4549.4 but we inserted only 4548 rows for ID 8. Now insert this one row and double check the histogram:

-- this will trigger the update
INSERT INTO dbo.TestStats VALUES(8,4549);

-- double check
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Q3) Are there any ways (gulp, tricks or otherwise) to improve the estimate (or make it less certain of 1 row) without having to update the statistics every time a new set of data is inserted (e.g. adding a fake data set at a much larger CacheId = 999999).

Controlling Autostat (AUTO_UPDATE_STATISTICS) behavior in SQL Server

However, when a table becomes very large, the old threshold (a fixed rate – 20% of rows changed) may be too high and the Autostat process may not be triggered frequently enough. This could lead to potential performance problems. SQL Server 2008 R2 Service Pack 1 and later versions introduce trace flag 2371 that you can enable to change this default behavior. The higher the number of rows in a table, the lower the threshold will become to trigger an update of the statistics. For example, if the trace flag is activated, update statistics will be triggered on a table with 1 billion rows when 1 million changes occur. If the trace flag is not activated, then the same table with 1 billion records would need 200 million changes before an update statistics is triggered.

Hope this helped you to understand! Pretty good question!

Best Answer

Related Solutions

Sql-server – Estimated vs. Actual rows and multi-column statistics

SQL Server – How to Improve Row Estimate for Newly Inserted Data in Join

Related Question