Improving Query Performance by Removing Hash Match Inner Join in SQL Server 2014

execution-planindexperformancequery-performancesql serversql server 2014

While trying to apply the contents of this question below to my own situation, I am a bit confused as how I could get rid of the operator Hash Match (Inner Join) if any way possible.

SQL Server query performance – removing need for Hash Match (Inner Join)

I noticed the cost of 10% and was wondering if I could reduce it.
See the query plan below.

enter image description here

This work comes from a query thad I had to tune today:

SELECT c.AccountCode, MIN(d.CustomerSID) 
FROM   Stage.Customer c 
INNER JOIN Dimensions.Customer d  ON c.Email = d.Email
                                  OR (
                                          c.HomePostCode = d.HomePostCode
                                       AND c.StrSurname = d.strSurname
                                                                    )
GROUP BY c.AccountCode

and after adding these indexes:

---------------------------------------------------------------------
-- Create the indexes
---------------------------------------------------------------------

CREATE NONCLUSTERED INDEX IDX_Stage_Customer_HOME_SURNAME_INCL
ON Stage.Customer(HomePostCode ,strSurname)
INCLUDE (AccountCode)
--WHERE HASEMAIL = 0
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go


CREATE NONCLUSTERED INDEX IDX_Dimensions_Customer_HOME_SURNAME_INCL
ON Dimensions.Customer(HomePostCode ,strSurname)
INCLUDE (AccountCode,CustomerSID)
--WHERE HASEMAIL = 0
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go



CREATE NONCLUSTERED INDEX IDX_Stage_Customer_EMAIL_INCL
ON Stage.Customer(EMAIL)
INCLUDE (AccountCode)
--WHERE HASEMAIL = 1
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go


CREATE NONCLUSTERED INDEX IDX_Dimensions_Customer_EMAIL_INCL
ON Dimensions.Customer(EMAIL)
INCLUDE (AccountCode,CustomerSID)
--WHERE HASEMAIL = 1
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go

this is the new query:

----------------------------------------------------------------------------
-- new query 
----------------------------------------------------------------------------

SELECT * 
FROM (    
SELECT AccountCode
     ,RO=ROW_NUMBER () OVER (PARTITION BY AccountCode ORDER BY CustomerSID)
     --,CustomerSID=MIN(CustomerSID) OVER (PARTITION BY AccountCode ORDER BY AccountCode)
       ,CustomerSID
FROM (    
          SELECT c.AccountCode, D.CustomerSID
       FROM   Stage.Customer c 
       INNER JOIN Dimensions.Customer d  ON c.Email = d.Email

          UNION ALL

          SELECT c.AccountCode, D.CustomerSID
       FROM   Stage.Customer c 
       INNER JOIN Dimensions.Customer d  ON c.HomePostCode = d.HomePostCode
                                        AND c.StrSurname = d.strSurname
) RADHE
) R1
WHERE RO = 1

This has reduced the query execution time from 8 minutes to 1 second.

Everybody is happy, but still I would like to know if I could get more done,
I.e. by somehow removing the hash match operator.

Why is it there at the first place, I am matching all the fields, why hash?

Best Answer

the following links will provide a good source of knowledge regarding execution plans.

From Execution Plan Basics — Hash Match Confusion I found:

From http://sqlinthewild.co.za/index.php/2007/12/30/execution-plan-operations-joins/

"The hash join is one of the more expensive join operations, as it requires the creation of a hash table to do the join. That said, it’s the join that’s best for large, unsorted inputs. It is the most memory-intensive of any of the joins

The hash join first reads one of the inputs and hashes the join column and puts the resulting hash and the column values into a hash table built up in memory. Then it reads all the rows in the second input, hashes those and checks the rows in the resulting hash bucket for the joining rows."

which links to this post:

http://blogs.msdn.com/b/craigfr/archive/2006/08/10/687630.aspx

Can you explain this execution plan? provides good insights about the execution plan with, not specific to hash match but relevant.

The constant scans are a way for SQL Server to create a bucket into which it's going to place something later in the execution plan. I've posted a more thorough explanation of it here. To understand what the constant scan is for, you have to look further into the plan. In this case, it's the Compute Scalar operators that are being used to populate the space created by the constant scan.

The Compute Scalar operators are being loaded up with NULL and the value 1045876, so they're clearly going to be used with the Loop Join in an effort to filter the data.

The really cool part is that this plan is Trivial. It means that it went through a minimal optimization process. All the operations are leading up to the Merge Interval. This is used to create a minimal set of comparison operators for an index seek (details on that here).

In this question: Can I get SSMS to show me the Actual query costs in the Execution plan pane? I'm fixing performance issues on a multistatement stored procedure in SQL Server. I want to know which part(s) I should spend time on.

I understand from How do I read Query Cost, and is it always a percentage? that even when SSMS is told to Include Actual Execution Plan, the "Query cost (relative to the batch)" figures is still based on cost estimates, which can be far off actuals

Measuring Query Performance : “Execution Plan Query Cost” vs “Time Taken” gives good info for when you need to compare the performance of 2 different queries.

In Reading a SQL Server Execution plan you can find great tips for reading the execution plan.

Other questions/answers that I really liked because they are relevant to this subject, and for my personal reference I would like to quote are:

How to optimise T-SQL query using Execution Plan

can sql generate a good plan for this procedure?

Execution Plans Differ for the Same SQL Statement

Related Solutions

Sql-server – Query tuning – performance

You should look at the Execution Plan and see where most of the cost is for the query. Without the results of the Execution Plan it is difficult to offer advice.

I suspect the query optimizer is using the best process possible for the query as it is structured. You might try breaking the query into smaller parts, like query the email field to get the results into a temp table first, then run your main query using the temp table of emails.

The might not make the query run faster, but this might expose information in the Execution Plan that will help you troubleshoot.

Another option is to use include to add columns to your non-clustered index to return all the values in the index scan to maybe speed up the process.

Sql-server – Hash Match inner join in simple query with in statement

For the 1st query, an index that uses all three columns from the WHERE clause and includes the column from the SELECT list would be much more useful:

-- index suggestion A
(clientid, empuid, newvalue) INCLUDE (m_uid)

or an index targeted specifically for this query:

-- index suggestion B
(clientid, empuid, m_uid)
WHERE newvalue in ('Deleted', 'DB-Deleted', 'Added')

Regarding the trigger, some comments:

The first query you show does not appear in the trigger. What appears is a join from that table to the inserted rows to another table (which has the trigger).
My suggestion B above seems better suited to be used by the trigger.

The trigger has 2 almost identical insert statements. Why? I think they could be combined in one - and simpler - insert and using NOT EXISTS instead of NOT IN:

insert EmpTaxAudit 
  ( Clientid, empuid, m_uid, m_eff_start_date, ColumnName, 
    ReplacedOn, ReplacedBy, OldValue, dblogin, 
    newvalue
  )
select 
    Clientid, empuid, m_uid, m_eff_start_date, 'taxcode',
    getdate(), IsNull(userid,@user), '', Left(@user,15),
    case when m_eff_end_date is null 
        then 'Added' else 'Deleted'
    end
from inserted i
where not exists 
      ( select 1 
        from EmpTaxAudit 
        where m_uid = i.m_uid 
          and clientid = i.clientid  
          and empuid = i.empuid
          and newvalue in ('Deleted', 'DB-Deleted', 'Added')
      ) ;

Best Answer

Related Solutions

Sql-server – Query tuning – performance

Sql-server – Hash Match inner join in simple query with in statement

Related Question