While trying to apply the contents of this question below to my own situation, I am a bit confused as how I could get rid of the operator Hash Match (Inner Join) if any way possible.
SQL Server query performance – removing need for Hash Match (Inner Join)
I noticed the cost of 10% and was wondering if I could reduce it.
See the query plan below.
This work comes from a query thad I had to tune today:
SELECT c.AccountCode, MIN(d.CustomerSID)
FROM Stage.Customer c
INNER JOIN Dimensions.Customer d ON c.Email = d.Email
OR (
c.HomePostCode = d.HomePostCode
AND c.StrSurname = d.strSurname
)
GROUP BY c.AccountCode
and after adding these indexes:
---------------------------------------------------------------------
-- Create the indexes
---------------------------------------------------------------------
CREATE NONCLUSTERED INDEX IDX_Stage_Customer_HOME_SURNAME_INCL
ON Stage.Customer(HomePostCode ,strSurname)
INCLUDE (AccountCode)
--WHERE HASEMAIL = 0
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go
CREATE NONCLUSTERED INDEX IDX_Dimensions_Customer_HOME_SURNAME_INCL
ON Dimensions.Customer(HomePostCode ,strSurname)
INCLUDE (AccountCode,CustomerSID)
--WHERE HASEMAIL = 0
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go
CREATE NONCLUSTERED INDEX IDX_Stage_Customer_EMAIL_INCL
ON Stage.Customer(EMAIL)
INCLUDE (AccountCode)
--WHERE HASEMAIL = 1
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go
CREATE NONCLUSTERED INDEX IDX_Dimensions_Customer_EMAIL_INCL
ON Dimensions.Customer(EMAIL)
INCLUDE (AccountCode,CustomerSID)
--WHERE HASEMAIL = 1
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go
this is the new query:
----------------------------------------------------------------------------
-- new query
----------------------------------------------------------------------------
SELECT *
FROM (
SELECT AccountCode
,RO=ROW_NUMBER () OVER (PARTITION BY AccountCode ORDER BY CustomerSID)
--,CustomerSID=MIN(CustomerSID) OVER (PARTITION BY AccountCode ORDER BY AccountCode)
,CustomerSID
FROM (
SELECT c.AccountCode, D.CustomerSID
FROM Stage.Customer c
INNER JOIN Dimensions.Customer d ON c.Email = d.Email
UNION ALL
SELECT c.AccountCode, D.CustomerSID
FROM Stage.Customer c
INNER JOIN Dimensions.Customer d ON c.HomePostCode = d.HomePostCode
AND c.StrSurname = d.strSurname
) RADHE
) R1
WHERE RO = 1
This has reduced the query execution time from 8 minutes to 1 second.
Everybody is happy, but still I would like to know if I could get more done,
I.e. by somehow removing the hash match operator.
Why is it there at the first place, I am matching all the fields, why hash?
Best Answer
the following links will provide a good source of knowledge regarding execution plans.
From Execution Plan Basics — Hash Match Confusion I found:
Can you explain this execution plan? provides good insights about the execution plan with, not specific to hash match but relevant.
In this question: Can I get SSMS to show me the Actual query costs in the Execution plan pane? I'm fixing performance issues on a multistatement stored procedure in SQL Server. I want to know which part(s) I should spend time on.
I understand from How do I read Query Cost, and is it always a percentage? that even when SSMS is told to Include Actual Execution Plan, the "Query cost (relative to the batch)" figures is still based on cost estimates, which can be far off actuals
Measuring Query Performance : “Execution Plan Query Cost” vs “Time Taken” gives good info for when you need to compare the performance of 2 different queries.
In Reading a SQL Server Execution plan you can find great tips for reading the execution plan.
Other questions/answers that I really liked because they are relevant to this subject, and for my personal reference I would like to quote are:
How to optimise T-SQL query using Execution Plan
can sql generate a good plan for this procedure?
Execution Plans Differ for the Same SQL Statement