Sql-server – Hash keys probe and residual

database-internalsexecution-planperformancequery-performancesql server

Say, we have a query like this:

select a.*,b.*
from 
a join b
on a.col1=b.col1
and len(a.col1)=10

Assuming the above query uses a Hash Join and has a residual, the probe key will be col1 and the residual will be len(a.col1)=10.

But while going through another example, I could see both the probe and the residual to be the same column. Below is an elaboration on what I am trying to say:

Query:

select *
from T1 join T2 on T1.a = T2.a

Execution plan, with probe and residual highlighted:

Test data:

create table T1 (a int, b int, x char(200))
create table T2 (a int, b int, x char(200))

set nocount on
declare @i int
set @i = 0
while @i < 1000
  begin
      insert T1 values (@i * 2, @i * 5, @i)
    set @i = @i + 1
  end

declare @i int
set @i = 0
while @i < 10000
  begin
    insert T2 values (@i * 3, @i * 7, @i)
    set @i = @i + 1
  end

Question:

How can a probe and a residual be the same column? Why can't SQL Server use the probe column only? Why does it have to use the same column as a residual to filter rows again?

References for test data:

Hash Join (Craig Freedman's SQL Server Blog)

Best Answer

If the join is on a single column typed as tinyint, smallint, or integer* and if both columns are constrained to be NOT NULL, the hash function is 'perfect' – meaning there is no chance of a hash collision, and the query processor does not have to check the values again to ensure they really match.

Otherwise, you will see a residual as items in the hash bucket are tested for a match, not just a hash function match.

Your test does not specify NULL or NOT NULL for the columns (a bad practice, by the way), so it appears you are using a database where NULL is the default.

More information in my post Join Performance, Implicit Conversions, and Residuals and Hash Join Execution Internals by Dmitry Pilugin.

_{* Other qualifying types are bit, smalldatetime, smallmoney, and (var)char(n) for n = 1 and binary collation}

Related Solutions

Why would SQL cast have to be expensive if used on a join column

CAST or any other function is expensive when used in a JOIN condition because it makes the transformed column(s) be non-SARGable.

The database has no way to know what the output of the function will be until it runs the function. This is as true for CAST as it is for a UDF that you write which does some fancy string manipulation logic.

The condition being non-SARGable means that the database can't use a search argument to find the value - it has to run it on every single row that matches the other criteria in the query.

For this reason, functions should be avoided in any filtering logic - JOINs, WHERE clauses, etc. wherever possible, as it essentially forces a table/index scan.

Sql-server – Using NOCOUNT to Improve Procedure Performance

You only need to SET NOCOUNT ON; once per procedure, preferably at the top of the body of the procedure itself. Certainly you'd need it prior to any statement that generates output.

So, for instance, I'd use something like this as a template for creating procedures:

CREATE PROCEDURE dbo.MyProc
AS
BEGIN
    SET NOCOUNT ON;
    ....
END
GO

Books Online says this about SET NOCOUNT ON:

Stops the message that shows the count of the number of rows affected by a Transact-SQL statement or stored procedure from being returned as part of the result set.

SET NOCOUNT ON prevents the sending of DONE_IN_PROC messages to the client for each statement in a stored procedure. For stored procedures that contain several statements that do not return much actual data, or for procedures that contain Transact-SQL loops, setting SET NOCOUNT to ON can provide a significant performance boost, because network traffic is greatly reduced.

Setting this option at the start of the procedure body, as I've outlined above, makes it easy to verify the statement is actually in the procedure.

Be aware, the row count feature is used by certain software, not least of which is SQL Server itself for Linked Servers, to ascertain whether or not the executed DML was successful. Setting NOCOUNT ON may cause errors to occur that you are not expecting, and that can be difficult to troubleshoot. Also note, the following comment and advice from @AaronBertrand:

One thing to keep in mind (and a disclaimer I give whenever I recommend NOCOUNT), is that it can interfere with certain technologies. For example if you have old ADO code (prior to ASP.NET), it interprets the DONE_IN_PROC messages as independent resultsets, so your existing code may already have things like rs.nextRecordSet() to skip them. Also certain modules in Entity Framework (and probably other ORMs) rely on those messages to determine success of DML operations. So don't just blindly add them to all of your code if you use these technologies and already have working code.

Best Answer

Related Solutions

Why would SQL cast have to be expensive if used on a join column

Sql-server – Using NOCOUNT to Improve Procedure Performance

Related Question