Sql-server – SQL Server unpredictable select results (dbms error?)

sql-server-2008sql-server-2012

Below is simple example, which returns strange results, that are unpredictable and we cannot explain it in our team. Are we doing something wrong or is it SQL Server error?

After some investigation we reduced the search area to union clause in subquery, which selects one record from "men" table

It works as expected in SQL Server 2000 (returns 12 rows), but in 2008 and 2012 it returns only one row.

create table dual (dummy int)

insert into dual values (0)

create table men (
man_id int,
wife_id int )

-- there are 12 men, 6 married 
insert into men values (1, 1)
insert into men values (2, 2)
insert into men values (3, null)
insert into men values (4, null)
insert into men values (5, null)
insert into men values (6, 3)
insert into men values (7, 5)
insert into men values (8, 7)
insert into men values (9, null)
insert into men values (10, null)
insert into men values (11, null)
insert into men values (12, 9)

This returns only one row: 1 1 2

select 
man_id,
wife_id,
(select count( * ) from 
    (select dummy from dual
     union select men.wife_id  ) family_members
) as family_size
from men
--where wife_id = 2 -- uncomment me and try again

Uncomment last line and it gives: 2 2 2

There is a lot of odd behaviors:

  • After series of drops, creates, truncates and inserts on "men" table it sometimes works (returns 12 rows)
  • When you change "union select men.wife_id" to "union all select men.wife_id" or "union select isnull(men.wife_id, null)" (!!!) it returns 12 rows (as expected).
  • The strange behavior seems to be unrelated to datatype of column "wife_id". We observed it on development system with much greater data sets.
  • "where wife_id > 0" returns 6 rows
  • we also observes strange behavior of views with this kind of statements. SELECT * returns subset of rows, SELECT TOP 1000 returns all

Best Answer

Are we doing something wrong or is it SQL Server error?

It is a wrong-results bug, which you should report via your usual support channel. If you do not have a support agreement, it may help to know that paid incidents are normally refunded if Microsoft confirms the behaviour as a bug.

The bug requires three ingredients:

  1. Nested loops with an outer reference (an apply)
  2. An inner-side lazy index spool that seeks on the outer reference
  3. An inner-side Concatenation operator

For example, the query in the question produces a plan like the following:

Annotated plan

There are many ways to remove one of these elements, so the bug no longer reproduces.

For example, one could create indexes or statistics that happen to mean the optimizer chooses not to utilize a Lazy Index Spool. Or, one could use hints to force a hash or merge union instead of using Concatenation. One could also rewrite the query to express the same semantics, but which results in a different plan shape where one or more of the required elements are missing.

More details

A Lazy Index Spool lazily caches inner side result rows, in a work table indexed by outer reference (correlated parameter) values. If a Lazy Index Spool is asked for an outer reference it has seen before, it fetches the cached result row from its work table (a "rewind"). If the spool is asked for an outer reference value it has not seen before, it runs its subtree with the current outer reference value and caches the result (a "rebind"). The seek predicate on the Lazy Index Spool indicates the key(s) for its work table.

The problem occurs in this specific plan shape when the spool checks to see if a new outer reference is the same as one it has seen before. The Nested Loops Join updates its outer references correctly, and notifies operators on its inner input via their PrepRecompute interface methods. At the start of this check, inner side operators read the CParamBounds:FNeedToReload property to see if the outer reference has changed from last time. An example stack trace is shown below:

CParamBounds:FNeedToReload

When the subtree shown above exists, specifically where Concatenation is used, something goes wrong (perhaps a ByVal/ByRef/Copy problem) with the bindings such that CParamBounds:FNeedToReload always returns false, regardless of whether the outer reference actually changed or not.

When the same subtree exists, but a Merge Union or Hash Union is used, this essential property is set correctly on each iteration, and the Lazy Index Spool rewinds or rebinds each time as appropriate. The Distinct Sort and Stream Aggregate are blameless, by the way. My suspicion is that Merge and Hash Union make a copy of the previous value, whereas Concatenation uses a reference. It is just about impossible to verify this without access to the SQL Server source code, unfortunately.

The net result is that the Lazy Index Spool in the problematic plan shape always thinks it has already seen the current outer reference, rewinds by seeking into its work table, generally finds nothing, so no row is returned for that outer reference. Stepping through the execution in a debugger, the spool only ever executes its RewindHelper method, and never its ReloadHelper method (reload = rebind in this context). This is evident in the execution plan because operators under the spool all have 'Number of Executions = 1'.

RewindHelper

The exception, of course, is for the first outer reference the Lazy Index Spool is given. This always executes the subtree and caches a result row in the work table. All subsequent iterations result in a rewind, which will only produce a row (the single cached row) when the current iteration has the same value for the outer reference as the first time around.

So, for any given input set on the outer side of the Nested Loops Join, the query will return as many rows as there are duplicates of the first row processed (plus one for the first row itself of course).

Demo

Table and sample data:

CREATE TABLE #T1 
(
    pk integer IDENTITY NOT NULL,
    c1 integer NOT NULL,

    CONSTRAINT PK_T1
    PRIMARY KEY CLUSTERED (pk)
);
GO
INSERT #T1 (c1)
VALUES
    (1), (2), (3), (4), (5), (6),
    (1), (2), (3), (4), (5), (6),
    (1), (2), (3), (4), (5), (6);

The following (trivial) query produces a correct count of two for each row (18 in total) using a Merge Union:

SELECT T1.c1, C.c1
FROM #T1 AS T1
CROSS APPLY 
(
    SELECT COUNT_BIG(*) AS c1
    FROM
    (
        SELECT T1.c1
        UNION
        SELECT NULL
    ) AS U
) AS C;

Merge Union Plan

If we now add a query hint to force a Concatenation:

SELECT T1.c1, C.c1
FROM #T1 AS T1
CROSS APPLY 
(
    SELECT COUNT_BIG(*) AS c1
    FROM
    (
        SELECT T1.c1
        UNION
        SELECT NULL
    ) AS U
) AS C
OPTION (CONCAT UNION);

The execution plan has the problematic shape:

Concatenation Plan

And the result is now incorrect, just three rows:

Three row result

Though this behaviour is not guaranteed, the first row from the Clustered Index Scan has a c1 value of 1. There are two other rows with this value, so three rows are produced in total.

Now truncate the data table and load it with more duplicates of the 'first' row:

TRUNCATE TABLE #T1;

INSERT #T1 (c1)
VALUES
    (1), (2), (3), (4), (5), (6),
    (1), (2), (3), (4), (5), (6),
    (1), (1), (1), (1), (1), (1);

Now the Concatenation plan is:

8 row Concatenation Plan

And, as indicated, 8 rows are produced, all with c1 = 1 of course:

8 row result

I notice you have opened a Connect item for this bug but really that is not the place to report issues that are having a production impact. If that is the case, you really ought to contact Microsoft Support.


This wrong-results bug was fixed at some stage. It no longer reproduces for me on any version of SQL Server from 2012 onward. It does repro on SQL Server 2008 R2 SP3-GDR build 10.50.6560.0 (X64).

Related Question