Sql-server – Inclusion of ORDER BY on query that returns no rows drastically affects performance

performancesql server

Given a simple three table join, query performance changes drastically when ORDER BY is included even with no rows returned. Actual problem scenario take 30 seconds to return zero rows but is instant when ORDER BY not included. Why?

SELECT * 
FROM tinytable t                          /* one narrow row */
JOIN smalltable s on t.id=s.tinyId        /* one narrow row */
JOIN bigtable b on b.smallGuidId=s.GuidId /* a million narrow rows */
WHERE t.foreignId=3                       /* doesn't match */
ORDER BY b.CreatedUtc          /* try with and without this ORDER BY */

I understand that I could have an index on bigtable.smallGuidId, but, I believe that would actually make it worse in this case.

Here's script to create/populate the tables for test. Curiously, it seems to matter that smalltable has an nvarchar(max) field. It also seems to matter that I'm joining on the bigtable with a guid (which I guess makes it want to use hash matching).

CREATE TABLE tinytable
  (
     id        INT PRIMARY KEY IDENTITY(1, 1),
     foreignId INT NOT NULL
  )

CREATE TABLE smalltable
  (
     id     INT PRIMARY KEY IDENTITY(1, 1),
     GuidId UNIQUEIDENTIFIER NOT NULL DEFAULT NEWID(),
     tinyId INT NOT NULL,
     Magic  NVARCHAR(max) NOT NULL DEFAULT ''
  )

CREATE TABLE bigtable
  (
     id          INT PRIMARY KEY IDENTITY(1, 1),
     CreatedUtc  DATETIME NOT NULL DEFAULT GETUTCDATE(),
     smallGuidId UNIQUEIDENTIFIER NOT NULL
  )

INSERT tinytable
       (foreignId)
VALUES(7)

INSERT smalltable
       (tinyId)
VALUES(1)

-- make a million rows 
DECLARE @i INT;

SET @i=20;

INSERT bigtable
       (smallGuidId)
SELECT GuidId
FROM   smalltable;

WHILE @i > 0
  BEGIN
      INSERT bigtable
             (smallGuidId)
      SELECT smallGuidId
      FROM   bigtable;

      SET @i=@i - 1;
  END 

I've tested on SQL 2005, 2008 and 2008R2 with same results.

Best Answer

I agree with Martin Smith's answer, but the problem is not simply one of statistics, exactly. The statistics for the foreignId column (assuming automatic statistics are enabled) accurately show that no rows exist for a value of 3 (there's just one, with a value of 7):

DBCC SHOW_STATISTICS (tinytable, foreignId) WITH HISTOGRAM

statistics output

SQL Server knows that things might have changed since the statistics were captured, so there might be a row for value 3 when the plan is executed. In addition, any amount of time might elapse between plan compilation and execution (plans are cached for reuse, after all). As Martin says, SQL Server contains logic to detect when sufficient modifications have been made to justify recompiling any cached plan for optimality reasons.

None of this ultimately matters, however. With one edge-case exception, the optimizer will never estimate the number of rows produced by a table operation to be zero. If it can statically determine that the output must always be zero rows, the operation is redundant and will be removed completely.

The optimizer's model instead estimates a minimum of one row. Employing this heuristic tends to produce better plans on average than would be the case if a lower estimate was possible. A plan that produces a zero-row estimate at some stage would be useless from that point on in the processing stream, since there would be no basis to make cost-based decisions (zero rows is zero rows no matter what). If the estimate turns out to be wrong, the plan shape above the zero row estimate stands almost no chance of being reasonable.

The second factor is another modelling assumption called the Containment Assumption. This essentially says that if a query joins a range of values with another range of values, it is because the ranges overlap. Another way to put this is to say that the join is being specified because rows are expected to be returned. Without this reasoning, costs would be generally underestimated, resulting in poor plans for a broad range of common queries.

Essentially, what you have here is a query that doesn't fit the optimizer's model. There's nothing we can do to 'improve' estimates with multi-column or filtered indexes; there's no way to get an estimate lower than 1 row here. A real database might have foreign keys to ensure that this situation could not arise, but assuming that is not applicable here, we are left with using hints to correct the out-of-model condition. Any number of different hint approaches will work with this query. OPTION (FORCE ORDER) is one that happens to work well with the query as written.