Sql-server – Join on NULL Key Column Optimizing as Table and Index Scans

execution-planjoin;nullsql serversql server 2014

I have a question about this query plan.

We have a table in a test environment, Order_Details_Taxes, that has 11,225,799 rows. This table has a column, OrdTax_PLTax_LoadDtl_Key, which is NULL on every single row. This test environment is configured in such a way that this column will always be NULL. There is an index on this column.

I ran some queries against this table using a NULL value for a column. A NULL INNER JOIN will never yield any results.

declare @Keys table (KeyValue decimal(15,0))
insert into @Keys (KeyValue) values (null)

select OrdTax_PLTax_LoadDtl_Key
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue

select *
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue

These are the first queries in the query plan. The first select starts from the hundred-million-row table and joins to @Keys. The second select starts from @Keys, but it does a clustered index scan on this table.

I know temporary @Tables are questionable in most cases, so I changed my query to use a temporary #Table:

if object_id ('tempdb..#Keys') is not null
    drop table #Keys
create table #Keys (KeyValue decimal(15,0))
insert into #Keys (KeyValue) values (null)

select OrdTax_PLTax_LoadDtl_Key
from #Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue

select *
from #Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = null

These queries were optimized and ran exactly as I expected– get the #Keys NULL value first and seek to Order_Details_Taxes. They are the last queries in the query plan linked.

Why do the queries in which I used a @Table variable perform index and table scans on this large table, when I am joining using from a table that has a single NULL value to a table with only NULLs in this key value?

I assume the answer is statistics and/or cardinality limitations of @Table variables, but the resulting query plan was non-intuitive to me.

ANSI_NULLs is on for this table and my SQL session.

Best Answer

The behavior that you're seeing is caused by the lack of statistics on the table variable. When I want to learn more about why the query optimizer chose a particular plan I sometimes add hints and compare the queries side by side. That approach is helpful here.

First I'll create a table which is close enough in structure to yours to see the same behavior:

CREATE TABLE dbo.Order_Details_Taxes (
    OrdTax_PLTax_LoadDtl_Key decimal(15,0),
    FILLER VARCHAR(30)
);

INSERT INTO dbo.Order_Details_Taxes WITH (TABLOCK)
SELECT NULL, REPLICATE('Z', 30)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

CREATE INDEX [IX_OrdTax_PLTax_LoadDtl_Key] ON Order_Details_Taxes (OrdTax_PLTax_LoadDtl_Key);

To see how the query optimizer costs the different join types I can get an estimated plan for the following:

declare @Keys table (KeyValue decimal(15,0))
insert into @Keys (KeyValue) values (null)

select OrdTax_PLTax_LoadDtl_Key
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue;

select OrdTax_PLTax_LoadDtl_Key
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue
OPTION (LOOP JOIN, MAXDOP 1);

select OrdTax_PLTax_LoadDtl_Key
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue
OPTION (HASH JOIN, MAXDOP 1);

Here is a screenshot of the estimated plans:

table variable plans

SQL Server doesn't know anything about the value of the row in the table variable, so it creates the nested loop plan using the density of the statistics on OrdTax_PLTax_LoadDtl_Key. All of the rows have the same value in the stats so the density is 1. One of the general assumptions of the query optimizer's models is that data exists if the end user is looking for it. So your index seek is expected to return the same number of rows as the scan and has the same cost, despite the fact that the histogram only contains NULLs. In this case, the optimizer doesn't go back and apply special knowledge about NULLs to change the plan. You could argue that the optimizer could be improved to do this, but this does seem like an uncommon scenario.

The difference in costs of the plans ultimately comes down to the costs of the join operators themselves. For whatever reason the query optimizer costs the loop join higher than the merge join. The hash join is costed high as well, but for that one SQL Server expects to need to compute millions of hashes so the higher cost is more understandable imo.

What happens if you get the same plan with a temp table that doesn't have stats? The right way to do this is to disable automatic statistics creation for the table but I'll take a shortcut:

if object_id ('tempdb..#Keys') is not null
    drop table #Keys
create table #Keys (KeyValue decimal(15,0))
CREATE STATISTICS s1 on #Keys (KeyValue) WITH NORECOMPUTE;
insert into #Keys (KeyValue) values (null)

Everything looks the same as the table variable plan:

temp table no stats

That's why I said the behavior is caused by the lack of statistics. When you use a temp table and allow auto stats creation the optimizer has a histogram on the temp table's column. It can use that information to generate more accurate cardinality estimates for the nested loop join plan and the index seek:

temp table stats

The histogram suggests that no columns will be matched so you end up with the minimum cardinality estimate of 1 row out of the seek. The costs of the loop join and the seek are reduced accordingly, and the nested loop join plan has by far the lowest cost out of the three join types.

Having some NULL values in the outer table of a join is a significantly more common scenario than joining to a table with all NULLs. In other words, I would expect more better model support for comparing two histograms that both contain NULL compared to a histogram to just NULLs compared to an unknown value. With better model support you can get better cardinality estimates, and in this case the better cardinality estimates result in a significantly more efficient query plan.