Sql-server – Join on NULL Key Column Optimizing as Table and Index Scans

execution-planjoin;nullsql serversql server 2014

I have a question about this query plan.

We have a table in a test environment, Order_Details_Taxes, that has 11,225,799 rows. This table has a column, OrdTax_PLTax_LoadDtl_Key, which is NULL on every single row. This test environment is configured in such a way that this column will always be NULL. There is an index on this column.

I ran some queries against this table using a NULL value for a column. A NULL INNER JOIN will never yield any results.

declare @Keys table (KeyValue decimal(15,0))
insert into @Keys (KeyValue) values (null)

select OrdTax_PLTax_LoadDtl_Key
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue

select *
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue

These are the first queries in the query plan. The first select starts from the hundred-million-row table and joins to @Keys. The second select starts from @Keys, but it does a clustered index scan on this table.

I know temporary @Tables are questionable in most cases, so I changed my query to use a temporary #Table:

if object_id ('tempdb..#Keys') is not null
    drop table #Keys
create table #Keys (KeyValue decimal(15,0))
insert into #Keys (KeyValue) values (null)

select OrdTax_PLTax_LoadDtl_Key
from #Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue

select *
from #Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = null

These queries were optimized and ran exactly as I expected– get the #Keys NULL value first and seek to Order_Details_Taxes. They are the last queries in the query plan linked.

Why do the queries in which I used a @Table variable perform index and table scans on this large table, when I am joining using from a table that has a single NULL value to a table with only NULLs in this key value?

I assume the answer is statistics and/or cardinality limitations of @Table variables, but the resulting query plan was non-intuitive to me.

ANSI_NULLs is on for this table and my SQL session.

Best Answer

The behavior that you're seeing is caused by the lack of statistics on the table variable. When I want to learn more about why the query optimizer chose a particular plan I sometimes add hints and compare the queries side by side. That approach is helpful here.

First I'll create a table which is close enough in structure to yours to see the same behavior:

CREATE TABLE dbo.Order_Details_Taxes (
    OrdTax_PLTax_LoadDtl_Key decimal(15,0),
    FILLER VARCHAR(30)
);

INSERT INTO dbo.Order_Details_Taxes WITH (TABLOCK)
SELECT NULL, REPLICATE('Z', 30)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

CREATE INDEX [IX_OrdTax_PLTax_LoadDtl_Key] ON Order_Details_Taxes (OrdTax_PLTax_LoadDtl_Key);

To see how the query optimizer costs the different join types I can get an estimated plan for the following:

declare @Keys table (KeyValue decimal(15,0))
insert into @Keys (KeyValue) values (null)

select OrdTax_PLTax_LoadDtl_Key
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue;

select OrdTax_PLTax_LoadDtl_Key
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue
OPTION (LOOP JOIN, MAXDOP 1);

select OrdTax_PLTax_LoadDtl_Key
from @Keys
inner join Order_Details_Taxes
    on OrdTax_PLTax_LoadDtl_Key = KeyValue
OPTION (HASH JOIN, MAXDOP 1);

Here is a screenshot of the estimated plans:

SQL Server doesn't know anything about the value of the row in the table variable, so it creates the nested loop plan using the density of the statistics on OrdTax_PLTax_LoadDtl_Key. All of the rows have the same value in the stats so the density is 1. One of the general assumptions of the query optimizer's models is that data exists if the end user is looking for it. So your index seek is expected to return the same number of rows as the scan and has the same cost, despite the fact that the histogram only contains NULLs. In this case, the optimizer doesn't go back and apply special knowledge about NULLs to change the plan. You could argue that the optimizer could be improved to do this, but this does seem like an uncommon scenario.

The difference in costs of the plans ultimately comes down to the costs of the join operators themselves. For whatever reason the query optimizer costs the loop join higher than the merge join. The hash join is costed high as well, but for that one SQL Server expects to need to compute millions of hashes so the higher cost is more understandable imo.

What happens if you get the same plan with a temp table that doesn't have stats? The right way to do this is to disable automatic statistics creation for the table but I'll take a shortcut:

if object_id ('tempdb..#Keys') is not null
    drop table #Keys
create table #Keys (KeyValue decimal(15,0))
CREATE STATISTICS s1 on #Keys (KeyValue) WITH NORECOMPUTE;
insert into #Keys (KeyValue) values (null)

Everything looks the same as the table variable plan:

That's why I said the behavior is caused by the lack of statistics. When you use a temp table and allow auto stats creation the optimizer has a histogram on the temp table's column. It can use that information to generate more accurate cardinality estimates for the nested loop join plan and the index seek:

The histogram suggests that no columns will be matched so you end up with the minimum cardinality estimate of 1 row out of the seek. The costs of the loop join and the seek are reduced accordingly, and the nested loop join plan has by far the lowest cost out of the three join types.

Having some NULL values in the outer table of a join is a significantly more common scenario than joining to a table with all NULLs. In other words, I would expect more better model support for comparing two histograms that both contain NULL compared to a histogram to just NULLs compared to an unknown value. With better model support you can get better cardinality estimates, and in this case the better cardinality estimates result in a significantly more efficient query plan.

Related Solutions

Mysql – Optimizing multi-table left joins

PLEASE use SHOW CREATE TABLE; DESCRIBE is not descriptive enough.

You seem to be working with IPv4 IP addresses. IPv6 is upon us!

Your query inherently must do a full table scan in order to get the "total" upon which you do the ORDER BY. Since there are a million rows, it will take time.

count('*')

The quotes are not needed.

The two SELECTs are likely to be identical; order across JOIN does not matter. It does matter across LEFT JOIN. (INNER and OUTER add nothing.)

Suggestion...

First, see if this gives the correct information (albeit incomplet info):

SELECT  e.sid, e.cid, COUNT(*) AS total
FROM  event AS e
JOIN  iphdr as h ON e.sid = h.sid
                AND  e.cid = h.cid
WHERE  e.is_deleted = 0
GROUP BY  i.ip_src, i.ip_dst
ORDER BY  total DESC
LIMIT  5;

If so, then use that as a subquery, a la FROM (that-select) and LEFT JOIN to the other two tables to get the rest of the info.

The trick... The subquery is doing less work since it is touching 2 tables (vs 4). The LEFT JOINs only have to do 5 probes each, not a million.

Sql-server – What index to add in case of join by two optional fields

Any time you apply a function to a column in a query, you prevent SQL Server from being able to use an index on that column. Also, it may often not be possible for SQL Server to use one of its more powerful join operations: merge and hash joins, which require equijoins - a simple columnA = columnB. For efficient queries, as many terms in the WHERE clause, and ON clauses, as possible should be sargable.

I think what you actually wanted is

SELECT B.X, A.Y
FROM tableA A INNER JOIN tableB B
ON (B.level5 = A.f05 OR B.level5 IS NULL) AND 
   (B.level6 = A.f06 OR B.level6 IS NULL)

Then I would expect that a non-clustered index on level5, level6 would do the job:

CREATE NONCLUSTERED INDEX Index_On_B ON tableB (level5, level6)

All your current index is doing is storing just the data from the columns in this query, which will cause less I/O than scanning tableB itself, assuming that there are other columns in tableB.

Best Answer

Related Solutions

Mysql – Optimizing multi-table left joins

Sql-server – What index to add in case of join by two optional fields

Related Question