Sql-server – Why is a hash match operator in this very basic query

indexperformancesql server

I'm beginning to learn some about looking at execution plans and making queries more efficient

Consider these two basic queries

select distinct pat_id, drug_class, drug_name from rx 

select pat_id, drug_class, drug_name from rx

and their execution plans

enter image description here

index being used:

CREATE CLUSTERED INDEX [ix_overlap] ON [dbo].[rx] 
(
    [pat_id] ASC,
    [fill_date] ASC,
    [script_end_date] ASC,
    [drug_name] ASC
)

Even though the first query supposedly has the higher cost by a 4:1 margin it runs faster than the second one. Why is it that a simple distinct added to the query will add the (what I assume to always be bad, corrections are welcome) hash match operator? And why does it have the higher query cost relative to the second query if it runs faster.

Best Answer

The first query is using a parallel plan, meaning the "work" was split into multiple tasks carried out by multiple threads. The cumulative CPU time was therefore higher than for the serial plan used for your second query.

As to why the distinct causes the hash match operator to appear in the plan; an aggregate or sort operation is required to determine the DISTINCT result. @SQL_Kiwi might pop up with a more indepth explanation shortly but the hash match operator is apparently favoured for larger result sets.

The optimizer tends to prefer the Hash Match Aggregate on larger rowsets, with fewer groups, where there is no reason to produce a sorted output, and where the incoming rows are not sorted on the DISTINCT expression(s). Larger inputs favour hash matching because the algorithm generally scales well (although it does require a memory grant) and can make good use of parallelism. Fewer groups are better for hashing because it means fewer entries in the hash table, and the memory needed to store unique values is proportional to the number of groups (and the size of the group). Hash matching does not require or preserve the order of the incoming row stream. [Source]

Related Solutions

Sql-server – Execution Plan Basics — Hash Match Confusion

As SQLRockstar's answer quotes

best for large, unsorted inputs.

Now,

from the Users.DisplayName index scan (assumed nonclustered) you get Users.Id (assuming clustered) = unsorted
You are also scanning Posts for OwnerUserId = unsorted

This is 2 unordered inputs.

I'd consider an index on the Posts table on OwnerUserId, including Title. This will add some order on one side of the input to the JOIN + it will be covering index

CREATE INDEX IX_OwnerUserId ON Posts (OwnerUserId) INCLUDE (Title)

You may then find that the Users.DisplayName index won't be used and it will scan the PK instead.

Sql-server – Interpreting an execution plan

From your question I gather that your table is relatively small. As you put more rows in the table you'll find that the bookmark lookup stays about the same, and the scan takes longer and longer. Eventually the scan will cost many times more than the bookmark lookup.

As SQLMenace said, execution plan costs are often unreliable. Use Query Profiler or SET STATISTICS to see what resources are actually being consumed by each query.

Finally, make sure statistics are updated on your table or the engine can make poor choices about which indexes or tables to use in what order.

Best Answer

Related Solutions

Sql-server – Execution Plan Basics — Hash Match Confusion

Sql-server – Interpreting an execution plan

Related Question