Sql-server – How to row estimates be improved in order to reduce chances of spills to tempdb

join;optimizationsql server

I notice that when there are spill to tempdb events (causing slow queries) that often the row estimates are way off for a particular join. I've seen spill events occur with merge and hash joins and they often increase the runtime 3x to 10x. This question concerns how to improve row estimates under the assumption that it will reduce chances of spill events.

Actual Number of rows 40k.

For this query, the plan shows bad row estimate (11.3 rows):

select Value
  from Oav.ValueArray
 where ObjectId = (select convert(bigint, Value) NodeId
                     from Oav.ValueArray
                    where PropertyId = 3331  
                      and ObjectId = 3540233
                      and Sequence = 2)
   and PropertyId = 2840
option (recompile);

For this query, the plan shows good row estimate (56k rows):

declare @a bigint = (select convert(bigint, Value) NodeId
                       from Oav.ValueArray
                      where PropertyId = 3331
                        and ObjectId = 3540233
                        and Sequence = 2);

select Value
  from Oav.ValueArray
 where ObjectId = @a               
   and PropertyId = 2840
option (recompile);

Can statistics or hints be added to improve the row estimates for the first case? I tried adding statistics with particular filter values (property = 2840) but either could not get the combination correct or perhaps it is being ignored because the ObjectId is unknown at compile time and it might be choosing an average over all ObjectIds.

Is there any mode where it would do the probe query first and then use that to determine the row estimates or must it fly blindly?

This particular property has many values (40k) on a few objects and zero on the vast majority. I would be happy with a hint where the max expected number of rows for a given join could be specified. This is a generally haunting problem because some parameters may be determined dynamically as part of the join or would be better placed within a view (no support for variables).

Are there any parameters that can be adjusted to minimize chance of spills to tempdb (e.g. min memory per query)? Robust plan had no effect on the estimate.

Edit 2013.11.06: Response to comments and additional information:

Here are the query plan images. The warnings are about the cardinality/seek predicate with the convert():

enter image description here

Per @Aaron Bertrand's comment, I tried replacing the convert() as a test:

create table Oav.SeekObject (
       LookupId bigint not null primary key,
       ObjectId bigint not null
);

insert into Oav.SeekObject (
   LookupId, ObjectId
) VALUES (
   1, 3540233
) 

select Value
  from Oav.ValueArray
 where ObjectId = (select ObjectId 
                     from Oav.SeekObject 
                    where LookupId = 1)
   and PropertyId = 2840
option (recompile);

enter image description here

As a odd but successful point of interest, also allowed it to short circuit the lookup:

select Value
  from Oav.ValueArray
 where ObjectId = (select ObjectId 
                     from Oav.ValueArray
                    where PropertyId = 2840
                      and ObjectId = 3540233
                      and Sequence = 2)
   and PropertyId = 2840
option (recompile);

enter image description here

Both of these list a proper key lookup but only the first ones lists an "Output" of ObjectId. I guess that indicates the second is indeed short circuiting?

Can someone verify whether single-row probes are ever performed to help with row estimates? Its seems wrong to limit optimization to only histogram estimates when a single-row PK lookup can greatly improve the accuracy of the lookup into the histogram (especially if there is spill potential or history). When there are 10 of these sub-joins in a real query, ideally they would be happening in parallel.

A side note, since sql_variant stores its base type (SQL_VARIANT_PROPERTY = BaseType) within the field itself, I would expect a convert() to be nearly costless so long as it is "directly" convertible (e.g. not string to decimal but rather int to int or maybe int to bigint). Since that is not known at compile time but may be known by the user, perhaps an "AssumeType(type, …)" function for sql_variants would allow them to be treated more transparently.

Best Answer

I won't comment about spills, tempdb or hints because the query seems pretty simple to need that much consideration. I think SQL-Server's optimizer will do its job quite good, if there are indexes suited for the query.

And your splitting into two queries is good as it shows what indexes will be useful. The first part:

(select convert(bigint, Value) NodeId
 from Oav.ValueArray
 where PropertyId = 3331  
   and ObjectId = 3540233
   and Sequence = 2)

needs an index on (PropertyId, ObjectId, Sequence) including the Value. I'd make it UNIQUE to be safe. The query would throw error anyway during runtime if more than one rows were returned, so it's good to ensure in advance that this won't happen, with the unique index:

CREATE UNIQUE INDEX
    PropertyId_ObjectId_Sequence_UQ
  ON Oav.ValueArray
    (PropertyId, ObjectId, Sequence) INCLUDE (Value) ;

The second part of the query:

select Value
  from Oav.ValueArray
 where ObjectId = @a               
   and PropertyId = 2840

needs an index on (PropertyId, ObjectId) including the Value:

CREATE INDEX
    PropertyId_ObjectId_IX
  ON Oav.ValueArray
    (PropertyId, ObjectId) INCLUDE (Value) ;

If efficiency is not improved or these indexes were not used or there are still differences in row estimates appearing, then there would be need to look further into this query.

In that case, the conversions (needed from the EAV design and the storing of different datatypes in the same columns) are a probable cause and your solution of splitting (as @AAron Bertrand and @Paul White comment) the query into two parts seems natural and the way to go. A redesign so to have different datatypes in their respective columns might be another.

Related Solutions

Sql-server – How to improve row estimate of 1 row in join for newly inserted data

Q1) Why mathematically is the original estimate so bad? I mean the CacheId's are sparse but not at a ratio of 20000:1.

Here is the rule to trigger auto update the stats Statistical maintenance functionality (autostats) in SQL Server:

The above algorithm can be summarised in the form of a table:

Table Type | Empty Condition | Threshold When Empty |Threshold When Not Empty

Permanent | < 500 rows | # of Changes >= 500 | # of Changes >= 500 + (20% of Cardinality)

Even thought the KB point to 2000, it's still true up to 2012.

Run through this scenario and see for yourself.

STEP#1

SET STATISTICS IO OFF;
GO
SET NOCOUNT ON;
GO
-- make sure the Include Actual Execution Plan is off!!!
IF OBJECT_ID('IDs') IS NOT NULL
DROP TABLE dbo.IDs;

CREATE TABLE IDs
(
ID tinyint NOT NULL
)

INSERT INTO IDs
SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7;

IF OBJECT_ID('TestStats') IS NOT NULL
DROP TABLE dbo.TestStats;

CREATE TABLE dbo.TestStats
(
 ID tinyint NOT NULL,
 Col1 int NOT NULL,
 CONSTRAINT PK_TestStats PRIMARY KEY CLUSTERED (ID, col1)
);

DECLARE @id int = 1
DECLARE @i int = 1

WHILE @id <= 6
BEGIN
 SET @i = 1

WHILE @i <= 20247
BEGIN
    INSERT INTO dbo.TestStats VALUES(@id,@i);

    SET @i = @i + 1
END

SET @id = @id + 1
END

-- so far so good!
SELECT ID, COUNT(*) AS RowCnt FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Now we have a table with IDs 1 through 6 and each ID has 20247 rows. Stats look good so far!

STEP#2

-- now insert another ID = 7 with 20247 rows
DECLARE @i int = 1;

WHILE @i <= 20247
BEGIN
  INSERT INTO dbo.TestStats VALUES(7,@i);

  SET @i = @i + 1
END

-- see the problem with the histogram?
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Look at the table and histogram! The actual table has ID = 7 with 20247 rows but the histogram has no idea that you've just inserted the new data because the auto update didn't trigger. According the the formula you need to insert (20247 * 6) * 0.2 + 500 = 24,796.4 rows to trigger an auto update for stats on this table.

Thus, if you look at the plans for these queries you see the wrong estimates:

-- CTRL + M to include the Actual Execution plan
-- now, IF we run these queries, the Optimizer has no info about ID = 7
-- and the Estimates 1 because it cannot say 0.
SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 1;

SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 7;

Query #1:

Query #2:

Query #2

The Optimize cannot say 0 rows, so it just shows you 1.

STEP#3

-- now we manually update the stats
UPDATE STATISTICS dbo.TestStats WITH FULLSCAN;

-- check the histogram
DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

-- rerun the queries
SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 1;

SELECT ts.*
FROM dbo.TestStats ts
INNER JOIN dbo.IDs ON IDs.ID = ts.ID
WHERE IDs.ID = 7;

Now the histogram show the missing ID 7 and the execution plans show the right estimates as well.

Query #1:

Query #1

Query #2:

Query #2

Q2) As the number of cacheId's increases would you expect the estimates for newly inserted data improve naturally?

Yes, as soon as you pass the threshold of 20% + 500 from the total rows. The auto update will trigger. You can run though this scenario by re-running STEP#1, but then modify STEP#2 by running these queries:

-- now insert another ID = 7 with 20247 rows
DECLARE @i int = 1;

WHILE @i <= 20247
BEGIN
   INSERT INTO dbo.TestStats VALUES(7,@i);

   SET @i = @i + 1
END

-- see the problem with the histogram?
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;
GO
-- try to insert ID = 8 to trigger the auto update for the stats
DECLARE @i int = 1;

WHILE @i <= 4548
BEGIN
  INSERT INTO dbo.TestStats VALUES(8,@i);

  SET @i = @i + 1
END

-- no update yet
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

No update yet because the threshold is 24,796.4 - 20247 = 4549.4 but we inserted only 4548 rows for ID 8. Now insert this one row and double check the histogram:

-- this will trigger the update
INSERT INTO dbo.TestStats VALUES(8,4549);

-- double check
SELECT ID, COUNT(*) FROM dbo.TestStats GROUP BY ID;

DBCC SHOW_STATISTICS('TestStats',PK_TestStats) WITH HISTOGRAM;

Q3) Are there any ways (gulp, tricks or otherwise) to improve the estimate (or make it less certain of 1 row) without having to update the statistics every time a new set of data is inserted (e.g. adding a fake data set at a much larger CacheId = 999999).

Controlling Autostat (AUTO_UPDATE_STATISTICS) behavior in SQL Server

However, when a table becomes very large, the old threshold (a fixed rate – 20% of rows changed) may be too high and the Autostat process may not be triggered frequently enough. This could lead to potential performance problems. SQL Server 2008 R2 Service Pack 1 and later versions introduce trace flag 2371 that you can enable to change this default behavior. The higher the number of rows in a table, the lower the threshold will become to trigger an update of the statistics. For example, if the trace flag is activated, update statistics will be triggered on a table with 1 billion rows when 1 million changes occur. If the trace flag is not activated, then the same table with 1 billion records would need 200 million changes before an update statistics is triggered.

Hope this helped you to understand! Pretty good question!

Sql-server – Why is an additional filtered statistic being ignored (EAV schema)

Filtered indexes and statistics won't come into play when you're using local variables, unless you use the OPTION (RECOMPILE) query hint, and are running SQL Server 2008 R2 or later.

Tim Chapman's MSDN blog post explains with examples.

Best Answer

Related Solutions

Sql-server – How to improve row estimate of 1 row in join for newly inserted data

Sql-server – Why is an additional filtered statistic being ignored (EAV schema)

Related Question