Sql-server – How to effectively handle very skewed data? Statistics up to date but don’t seem to help

cardinality-estimatesperformancequery-performancesql serversql-server-2012statistics

I have a table, dbo.ClaimBilling, with 130,000 rows. In this table, the column OperatorID is a varchar(max) and is heavily skewed. 125,000 rows are 'user1', and the remaining 5000 rows are split between 6 other values, with 'user2' having a total of 3 records.

There is a non-clustered index on OperatorID, and the clustered index is the primary key, IDClaimBilling.

I have the following query currently:

SELECT DISTINCT IDClaimBilling
FROM dbo.ClaimBilling cb
INNER JOIN dbo.BillingItem bi
    ON cb.IDClaimBilling = bi.ClaimID
WHERE OperatorID = @operator

No matter what value @operator is, the estimate of rows from ClaimBilling is ~4000, which is not close to what any value would return, and it's always a clustered index scan, it doesn't use the operatorID index. If I remove the join and do

SELECT DISTINCT IDClaimBilling
FROM dbo.ClaimBilling
WHERE OperatorID = @operator

then it does use the OperatorID index, but again the estimates are wrong regardless of the value of @operator, this time always estimating ~18,000 or so.

I did an UPDATE STATISTICS dbo.ClaimBilling WITH FULLSCAN before running the queries.

Why are these estimates so wrong even if the statistics know exactly how many rows there are per value?

I'm declaring and assigning @operator a value in testing. It was originally part of a procedure and I assumed that was the problem, but it behaves the same when used in an ad hoc statement as well.

The query only runs when the user first logs in, so only maybe a couple times a day per user.

Best Answer

_{Community Wiki answer generated from comments left on the question}

If you are running the query with @operator as a variable, SQL Server cannot 'sniff' the value in the variable, so it will use the statistics' average density value to calculate an estimate. No matter what value you assign to the variable, this average value estimate will always be the same.

One way to resolve this is by using an OPTION (RECOMPILE) query hint. This will compile a fresh plan on each execution, with the plan optimized for the specific value in the variable at that time. This comes at the cost (usually small) of a statement recompilation each time.

You could also modularize the code. You could use an IF statement and check the value of operatorid and if it's "user1" call one stored proc, let's say sp_user1. If it is not "user1" call a different procedure. The first sp would be optimized for "user1" and the other for the rest of the value. You could also use option (recompile) in the second sp for non-"user1" values if required.

It might also be a decent use-case for dynamic SQL. That would turn your @operator variable into a literal value, and would customize the plan for each user. Since there's only 7 users in that table I don't think that would really pose an issue for you.

For more information, see:

Local Variables vs. Parameterized Stored Procedures in SQL Server (Video) by Kendra Little

Related Solutions

Sql-server – How to improve row estimate of 1 row in join for newly inserted data

Q1) Why mathematically is the original estimate so bad? I mean the CacheId's are sparse but not at a ratio of 20000:1.

Here is the rule to trigger auto update the stats Statistical maintenance functionality (autostats) in SQL Server:

The above algorithm can be summarised in the form of a table:

Table Type | Empty Condition | Threshold When Empty |Threshold When Not Empty

Permanent | < 500 rows | # of Changes >= 500 | # of Changes >= 500 + (20% of Cardinality)

Even thought the KB point to 2000, it's still true up to 2012.

Run through this scenario and see for yourself.

STEP#1

SET STATISTICS IO OFF;
GO
SET NOCOUNT ON;
GO
-- make sure the Include Actual Execution Plan is off!!!
IF OBJECT_ID('IDs') IS NOT NULL
DROP TABLE dbo.IDs;

CREATE TABLE IDs
(
ID tinyint NOT NULL
)

INSERT INTO IDs
SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7;

IF OBJECT_ID('TestStats') IS NOT NULL
DROP TABLE dbo.TestStats;

CREATE TABLE dbo.TestStats
(
 ID tinyint NOT NULL,
 Col1 int NOT NULL,
 CONSTRAINT PK_TestStats PRIMARY KEY CLUSTERED (ID, col1)
);

DECLARE @id int = 1
DECLARE @i int = 1

WHILE @id <= 6
BEGIN
 SET @i = 1

WHILE @i <= 20247
BEGIN
    INSERT INTO dbo.TestStats VALUES(@id,@i);

    SET @i = @i + 1
END

SET @id = @id + 1
END

-- so far so good!
SELECT ID, COUNT(*) AS RowCnt FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Now we have a table with IDs 1 through 6 and each ID has 20247 rows. Stats look good so far!

STEP#2

-- now insert another ID = 7 with 20247 rows
DECLARE @i int = 1;

WHILE @i <= 20247
BEGIN
  INSERT INTO dbo.TestStats VALUES(7,@i);

  SET @i = @i + 1
END

-- see the problem with the histogram?
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Look at the table and histogram! The actual table has ID = 7 with 20247 rows but the histogram has no idea that you've just inserted the new data because the auto update didn't trigger. According the the formula you need to insert (20247 * 6) * 0.2 + 500 = 24,796.4 rows to trigger an auto update for stats on this table.

Thus, if you look at the plans for these queries you see the wrong estimates:

-- CTRL + M to include the Actual Execution plan
-- now, IF we run these queries, the Optimizer has no info about ID = 7
-- and the Estimates 1 because it cannot say 0.
SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 1;

SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 7;

Query #1:

Query #2:

Query #2

The Optimize cannot say 0 rows, so it just shows you 1.

STEP#3

-- now we manually update the stats
UPDATE STATISTICS dbo.TestStats WITH FULLSCAN;

-- check the histogram
DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

-- rerun the queries
SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 1;

SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 7;

Now the histogram show the missing ID 7 and the execution plans show the right estimates as well.

Query #1:

Query #1

Query #2:

Query #2

Q2) As the number of cacheId's increases would you expect the estimates for newly inserted data improve naturally?

Yes, as soon as you pass the threshold of 20% + 500 from the total rows. The auto update will trigger. You can run though this scenario by re-running STEP#1, but then modify STEP#2 by running these queries:

-- now insert another ID = 7 with 20247 rows
DECLARE @i int = 1;

WHILE @i <= 20247
BEGIN
   INSERT INTO dbo.TestStats VALUES(7,@i);

   SET @i = @i + 1
END

-- see the problem with the histogram?
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;
GO
-- try to insert ID = 8 to trigger the auto update for the stats
DECLARE @i int = 1;

WHILE @i <= 4548
BEGIN
  INSERT INTO dbo.TestStats VALUES(8,@i);

  SET @i = @i + 1
END

-- no update yet
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

No update yet because the threshold is 24,796.4 - 20247 = 4549.4 but we inserted only 4548 rows for ID 8. Now insert this one row and double check the histogram:

-- this will trigger the update
INSERT INTO dbo.TestStats VALUES(8,4549);

-- double check
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Q3) Are there any ways (gulp, tricks or otherwise) to improve the estimate (or make it less certain of 1 row) without having to update the statistics every time a new set of data is inserted (e.g. adding a fake data set at a much larger CacheId = 999999).

Controlling Autostat (AUTO_UPDATE_STATISTICS) behavior in SQL Server

However, when a table becomes very large, the old threshold (a fixed rate – 20% of rows changed) may be too high and the Autostat process may not be triggered frequently enough. This could lead to potential performance problems. SQL Server 2008 R2 Service Pack 1 and later versions introduce trace flag 2371 that you can enable to change this default behavior. The higher the number of rows in a table, the lower the threshold will become to trigger an update of the statistics. For example, if the trace flag is activated, update statistics will be triggered on a table with 1 billion rows when 1 million changes occur. If the trace flag is not activated, then the same table with 1 billion records would need 200 million changes before an update statistics is triggered.

Hope this helped you to understand! Pretty good question!

Sql-server – statistics are up to date, but estimate is incorrect

There is a simple solution to this:

Drop all of the _dta_... statistics and stop blindly applying DTA recommendations.

More information

The particular problem was that there were multiple sets of statistics for the column in question. The extra dta statistics were created by sampling the data (the default behaviour for statistics not associated with an index).

As is often the case with sampled statistics, the resulting histograms did not cover the full range of the unerlying data. The query in the question happened to choose a value that was outside the histogram, resulting in a 1-row estimate.

The exact behaviour of the query optimizer when multiple sets of statistics exist for the same column is not fully documented. It does tend to prefer 'full scan' statistics over sampled, but it also prefers more recently-updated statistics to older ones.

Best Answer

Related Solutions

Sql-server – How to improve row estimate of 1 row in join for newly inserted data

Sql-server – statistics are up to date, but estimate is incorrect

Related Question