Sql-server – Why is a hash match operator in this very basic query

indexperformancesql server

I'm beginning to learn some about looking at execution plans and making queries more efficient

Consider these two basic queries

select distinct pat_id, drug_class, drug_name from rx 

select pat_id, drug_class, drug_name from rx 

and their execution plans

enter image description here

index being used:

CREATE CLUSTERED INDEX [ix_overlap] ON [dbo].[rx] 
(
    [pat_id] ASC,
    [fill_date] ASC,
    [script_end_date] ASC,
    [drug_name] ASC
)

Even though the first query supposedly has the higher cost by a 4:1 margin it runs faster than the second one. Why is it that a simple distinct added to the query will add the (what I assume to always be bad, corrections are welcome) hash match operator? And why does it have the higher query cost relative to the second query if it runs faster.

Best Answer

The first query is using a parallel plan, meaning the "work" was split into multiple tasks carried out by multiple threads. The cumulative CPU time was therefore higher than for the serial plan used for your second query.

As to why the distinct causes the hash match operator to appear in the plan; an aggregate or sort operation is required to determine the DISTINCT result. @SQL_Kiwi might pop up with a more indepth explanation shortly but the hash match operator is apparently favoured for larger result sets.

The optimizer tends to prefer the Hash Match Aggregate on larger rowsets, with fewer groups, where there is no reason to produce a sorted output, and where the incoming rows are not sorted on the DISTINCT expression(s). Larger inputs favour hash matching because the algorithm generally scales well (although it does require a memory grant) and can make good use of parallelism. Fewer groups are better for hashing because it means fewer entries in the hash table, and the memory needed to store unique values is proportional to the number of groups (and the size of the group). Hash matching does not require or preserve the order of the incoming row stream. [Source]