Sql-server – What cases benefit from the Reduce, Replicate, and Redistribute join hints

azure-sql-data-warehousehintsjoin;sql serversql-server-pdw

The From Clause Documentation starting with SQL Server 2008 briefly mention 3 join hints and their basic mechanisms:

  • Reduce
  • Replicate
  • Redistribute

However there does not seem to be much information on when it might become necessary to use them.

It appears that they can be used in conjunction with the hash, loop and merge which are already understood for purpose of this question.

The relevant section from the documentation:

For SQL Data Warehouse and Parallel Data Warehouse, these join hints apply to INNER joins on two distribution incompatible columns. They can improve query performance by restricting the amount of data movement that occurs during query processing. The allowable join hints for SQL Data Warehouse and Parallel Data Warehouse are as follows:

  • REDUCE
    Reduces the number of rows to be moved for the table on the right side of the join in order to make two distribution incompatible tables compatible. The REDUCE hint is also called a semi-join hint.

  • REPLICATE
    Causes the values in the joining column from the table on the left side of the join to be replicated to all nodes. The table on the right is joined to the replicated version of those columns.

  • REDISTRIBUTE
    Forces two data sources to be distributed on columns specified in the JOIN clause. For a distributed table, Parallel Data Warehouse will perform a shuffle move. For a replicated table, Parallel Data Warehouse will perform a trim move. To understand these move types, see the "DMS Query Plan Operations" section in the "Understanding Query Plans" topic in the Parallel Data Warehouse product documentation. This hint can improve performance when the query plan is using a broadcast move to resolve a distribution incompatible join.

Best Answer

In general terms we use a join hint is to correct a failure of the optimizer to identify a better query plan. This can happen for many reasons which broadly boil down to optimization being a "good enough" approach constrained by a time budget for finding a plan.

Reduce

"also called a semi-join hint". With a semi-join we are only interested in whether or not any matching row exists in the other table. Indeed, it is often because the query has an EXISTS() predicate. If there is one match or one million the result will be the same - the single row from the left (outer) table will be returned.

As an example, we may want to query all Customers who have ever placed an Order. Let's say Customer is a small table but Order is huge and spread over many nodes. It would be wasteful to copy all that data to wherever Customer is held when all we need is to know if at least one row exists for each Customer. The REDUCE hint provides this. Conceptually it is select distinct CustomerId from Order run on each node holding Order, which is relatively fast.

Replicate

This is the converse of REDUCE. Here a table is copied en masse from its "home" node to all nodes holding data from the table on the right-hand side of the join.

For this example we want the total value of all orders, summed by Customer city. We only have a relatively few customers which comfortably fit on a single node. The Orders table, however, is huge and spans many nodes. To satisfy the query we have to get both Customer and Order data onto the same node. It will be more efficient to copy Customer, since it is smaller. The REPLICATE hint forces that behaviour, making a complete copy of the Customer table on each node holding Order rows.

Redistribute

This is another way of stating how rows should be moved from their "home" node to another node so join predicates can be evaluated. This time we are asking the server to copy the data and distribute that copy according to some new distribution function such that the copied rows will end up on the right node to complete the join.

Say now we have a retail e-commerce site with lots of customers and lots of orders. Customers are distributed by CustomerId and Orders distributed by OrderDate. If we now want to join on Customer.CustomerId = Order.CustomerId the most efficient plan may be to copy Order and re-distribute according to CustomerId, then pass each new partition to the corresponding node. This differs from REPLICATE as there the whole table is copied to every node whereas here a portion of each node's data is copied over to other nodes. The specific way the re-distribution is handled will depend in the current distribution and the cardinalities.