Sql-server – Does the hash join operator always pull from the build side

sql serversql server 2014

Watching queries execute with Live Query Statistics I noticed that it seems SQL Server is lazily constructing a hash table from the build input of a hash join.

This is a meaningful difference in the case of 0 probe rows. It potentially saves the entire build side tree.

I always thought a hash ran like this:

Create build-side hash table.
Match all probe rows against it.

But is it the following instead?

Pull first probe row.
Complete the operation if no row available (short-circuit).
Match all probe rows against it.

I'm not sure how to conclusively test which way it is. I'm not sure the Live Query Statistics output can be trusted in this way. Does anyone know how this works?

Best Answer

If you post the actual query plan for your specific query, we might be able to comment on that more directly.

But in general, I believe (and have always observed) that the hash join operator does always runs as you expected:

Create build-side hash table.
Match all probe rows against it.

This generally makes sense given that SQL Server will place the smaller (estimated) rowset on the build side of the join. The probe side is expected to be larger than the build side (and therefore to contain at least some rows).

In addition, SQL Server has a potential bitmap optimization that it can apply to the probe side of a hash join in some cases, but this optimization requires that the build side be processed first.

An example

For a simple query where we force a large build side of the hash join and an empty probe side, the actual execution plan shows that all rows are processed on the build side of the join.

-- Fully processes the build side of the hash join
SELECT SUM(b.number*p.number) AS meaningless
FROM (
    SELECT 1.0 * v1.number * v2.number AS number
    FROM master..spt_values v1
    CROSS JOIN master..spt_values v2
) b
INNER HASH JOIN (
    SELECT v3.number
    FROM master..spt_values v3
    WHERE number < -100000
) p
    ON p.number = b.number

Can the build side ever be skipped?

There is at least one example where the build side of the hash join can be skipped: In some cases, the entire join can be eliminated. For example, here SQL Server is able to prove prior to execution that the probe side will have 0 rows. The final query plan is therefore a constant scan and there is no hash join.

-- No hash join at all; the query optimizer realizes no rows are guaranteed
SELECT SUM(b.number*p.number) AS meaningless
FROM (
    SELECT 1.0 * v1.number * v2.number AS number
    FROM master..spt_values v1
    CROSS JOIN master..spt_values v2
) b
INNER HASH JOIN (
    SELECT v3.number
    FROM master..spt_values v3
    WHERE 0=1
) p
    ON p.number = b.number

What about batch mode?

Based on a quick test, it seems that the batch mode hash join operator will also fully process the build side of the hash join.

-- The batch mode hash join also fully processes the build side
-- even if there are 0 rows on the probe side
SELECT number
INTO #spt_values
FROM master..spt_values
GO
CREATE CLUSTERED COLUMNSTORE INDEX CCI_#spt_values ON #spt_values
GO
SELECT SUM(b.number*p.number) AS meaningless
FROM (
    SELECT 1.0 * v1.number * v2.number AS number
    FROM #spt_values v1
    CROSS JOIN #spt_values v2
) b
INNER HASH JOIN (
    SELECT v3.number
    FROM #spt_values v3
    WHERE number < -100000
) p
    ON p.number = b.number
GO

Related Solutions

Sql-server – How to propagate a grandparent relationship throughout a tree

@Tim you have come to the right place.

It just so happens that I answered a question like this ( Find highest level of a hierarchical field: with vs without CTEs ) back in Oct 24, 2011.

I wrote Stored Procedures that traverse a table with id and parent fields. I also included sample data.

However, you may not have to resort to the Stored Procedures.

Here is what I mean: Let's start with the sample data

USE junk 
DROP TABLE IF EXISTS pctable; 
CREATE TABLE pctable 
( 
    id INT NOT NULL AUTO_INCREMENT, 
    parent_id INT, 
    PRIMARY KEY (id) 
) ENGINE=MyISAM; 
INSERT INTO pctable (parent_id) VALUES (0); 
INSERT INTO pctable (parent_id) SELECT parent_id+1 FROM pctable; 
INSERT INTO pctable (parent_id) SELECT parent_id+2 FROM pctable; 
INSERT INTO pctable (parent_id) SELECT parent_id+3 FROM pctable; 
INSERT INTO pctable (parent_id) SELECT parent_id+4 FROM pctable; 
INSERT INTO pctable (parent_id) SELECT parent_id+5 FROM pctable; 
SELECT * FROM pctable;

To reach any grandparent, you could just perform a JOIN of the pctable against itself:

SELECT P.id,GP.parent_id FROM pctable P
INNER JOIN pctable GP ON P.parent_id = GP.id;

To reach any great-grandparent, you could just perform a JOIN of the pctable against itself twice:

SELECT P.id,GGP.parent_id FROM pctable P
INNER JOIN pctable GP ON P.parent_id = GP.id;
INNER JOIN pctable GGP ON GP.parent_id = GGP.id;

Since your depth is no more than three, doing these JOINs should be quite enough.

Sql-server – How to row estimates be improved in order to reduce chances of spills to tempdb

I won't comment about spills, tempdb or hints because the query seems pretty simple to need that much consideration. I think SQL-Server's optimizer will do its job quite good, if there are indexes suited for the query.

And your splitting into two queries is good as it shows what indexes will be useful. The first part:

(select convert(bigint, Value) NodeId
 from Oav.ValueArray
 where PropertyId = 3331  
   and ObjectId = 3540233
   and Sequence = 2)

needs an index on (PropertyId, ObjectId, Sequence) including the Value. I'd make it UNIQUE to be safe. The query would throw error anyway during runtime if more than one rows were returned, so it's good to ensure in advance that this won't happen, with the unique index:

CREATE UNIQUE INDEX
    PropertyId_ObjectId_Sequence_UQ
  ON Oav.ValueArray
    (PropertyId, ObjectId, Sequence) INCLUDE (Value) ;

The second part of the query:

select Value
  from Oav.ValueArray
 where ObjectId = @a               
   and PropertyId = 2840

needs an index on (PropertyId, ObjectId) including the Value:

CREATE INDEX
    PropertyId_ObjectId_IX
  ON Oav.ValueArray
    (PropertyId, ObjectId) INCLUDE (Value) ;

If efficiency is not improved or these indexes were not used or there are still differences in row estimates appearing, then there would be need to look further into this query.

In that case, the conversions (needed from the EAV design and the storing of different datatypes in the same columns) are a probable cause and your solution of splitting (as @AAron Bertrand and @Paul White comment) the query into two parts seems natural and the way to go. A redesign so to have different datatypes in their respective columns might be another.

Best Answer

Related Solutions

Sql-server – How to propagate a grandparent relationship throughout a tree

Sql-server – How to row estimates be improved in order to reduce chances of spills to tempdb

Related Question