Sql-server – Execution Plan Basics — Hash Match Confusion

execution-plansql serversql-server-2008-r2

I am starting to learn execution plans and am confused about how exactly a hash match works and why it would be used in a simple join:

select Posts.Title, Users.DisplayName
From Posts JOIN Users on
Posts.OwnerUserId = Users.Id
OPTION (MAXDOP 1)

enter image description here

As I understand it the results of the Top index scan become the hash able and each row in the bottom Index clustered scan is looked up. I understand how hash tables work to at least some degree, but I am confused about which values exactly get hashed in an example like this.

What would make sense me is the the common field between them, the id, is hashed — but if this is the case, why hash a number?

Best Answer

As SQLRockstar's answer quotes

best for large, unsorted inputs.

Now,

from the Users.DisplayName index scan (assumed nonclustered) you get Users.Id (assuming clustered) = unsorted
You are also scanning Posts for OwnerUserId = unsorted

This is 2 unordered inputs.

I'd consider an index on the Posts table on OwnerUserId, including Title. This will add some order on one side of the input to the JOIN + it will be covering index

CREATE INDEX IX_OwnerUserId ON Posts (OwnerUserId) INCLUDE (Title)

You may then find that the Users.DisplayName index won't be used and it will scan the PK instead.

Related Solutions

Sql-server – Optimizer not choosing index union plan

The optimizer does not always consider index-union plans (like the one shown in your second graphic) to resolve disjunctions (OR predicates) unless a FORCESEEK or INDEX hint is specified. This is a heuristic* based on some practical considerations:

Index union is not often enough a good plan selection for general queries.
The number of ways indexes can be combined grows exponentially.

Using a hint changes the way the optimizer searches the space of possible plans. It disables some of the general heuristics and pursues a more goal-orientated strategy.

The optimizer's usual primary goal is to find a good plan quickly. It does not exhaustively search for the 'best' plan (even relatively simple queries could take years to compile if it did).

Joins with multiple conditions separated with OR have long been problematic. Over the years, the optimizer has added new tricks like converting them to equivalent UNION forms, but the transformations available are limited, so it is quite easy to come unstuck.

As far as the query plan is concerned:

The first row from DispatchLink causes a full scan of the Dispatch table
The result of the scan is stored in an internal tempdb worktable (the Table Spool)
The join checks every row from the worktable against the full OR predicate
The next row is fetched from DispatchLink and the process repeats from step 3

If there are 25,000 rows in the Dispatch Link table, the spool will be fully scanned 25,000 times. This is a disaster of course (and without index intersection, the best the optimizer can do is run the whole thing on multiple threads).

Percentage costs in query plans are only the optimizer's estimates. They never reflect actual execution costs, and are subject to the optimizer's model and will usually bear little resemblance to the 'true' cost of executing the plan on your specific hardware.

Costing numbers are there to be informative, but they should not be taken literally. The particular model the optimizer uses happens to produce pretty good plans for most queries on most systems across the world - that does not mean the model approximates anyone's reality, just that it happens to work reasonably well in practice.

Changing the design so that (Dispatch, Contract) pairs are stored in rows rather than repeated across columns will make the whole index-intersection problem go away. Relational designs with useful constraints and indexes almost always get the best out of the optimizer.

_{* This can be overridden with undocumented trace flag 8726}

Sql-server – How (and why) does TOP impact an execution plan

I would have guessed that when a query includes TOP n the database engine would run the query ignoring the the TOP clause, and then at the end just shrink that result set down to the n number of rows that was requested. The graphical execution plan seems to indicate this is the case -- TOP is the "last" step. But it appears there is more going on.

The way the above is phrased makes me think you may have an incorrect mental picture of how a query executes. An operator in a query plan is not a step (where the full result set of a previous step is evaluated by the next one.

SQL Server uses a pipelined execution model, where each operator exposes methods like Init(), GetRow(), and Close(). As the GetRow() name suggests, an operator produces one row at a time on demand (as required by its parent operator). This is documented in the Books Online Logical and Physical Operators reference, with more detail in my blog post Why Query Plans Run Backwards. This row-at-a-time model is essential in forming a sound intuition for query execution.

My question is, how (and why) does a TOP n clause impact the execution plan of a query?

Some logical operations like TOP, semi joins and the FAST n query hint affect the way the query optimizer costs execution plan alternatives. The basic idea is that one possible plan shape might return the first n rows more quickly than a different plan that was optimized to return all rows.

For example, indexed nested loops join is often the fastest way to return a small number of rows, though hash or merge join with scans might be more efficient on larger sets. The way the query optimizer reasons about these choices is by setting a Row Goal at a particular point in the logical tree of operations.

A row goal modifies the way query plan alternatives are costed. The essence of it is that the optimizer starts by costing each operator as if the full result set were required, sets a row goal at the appropriate point, and then works back down the plan tree estimating the number of rows it expects to need to examine to meet the row goal.

For example, a logical TOP(10) sets a row goal of 10 at a particular point in the logical query tree. The costs of operators leading up to the row goal are modified to estimate how many rows they need to produce to meet the row goal. This calculation can become complex, so it is easier to understand all this with a fully worked example and annotated execution plans. Row goals can affect more than the choice of join type or whether seeks and lookups are preferred to scans. More details on that here.

As always, an execution plan selected on the basis of a row goal is subject to the optimizer's reasoning abilities and the quality of information provided to it. Not every plan with a row goal will produce the required number of rows faster in practice, but according to the costing model it will.

Where a row goal plan proves not to be faster, there are usually ways to modify the query or provide better information to the optimizer such that the naturally selected plan is best. Which option is appropriate in your case depends on the details of course. The row goal feature is generally very effective (though there is a bug to watch out for when used in parallel execution plans).

Your particular query and plan may not be suitable for detailed analysis here (by all means provide an actual execution plan if you wish) but hopefully the ideas outlined here will allow you to make forward progress.

Best Answer

Related Solutions

Sql-server – Optimizer not choosing index union plan

Sql-server – How (and why) does TOP impact an execution plan

Related Question