Sql-server – What cases benefit from the Reduce, Replicate, and Redistribute join hints

azure-sql-data-warehousehintsjoin;sql serversql-server-pdw

The From Clause Documentation starting with SQL Server 2008 briefly mention 3 join hints and their basic mechanisms:

Reduce
Replicate
Redistribute

However there does not seem to be much information on when it might become necessary to use them.

It appears that they can be used in conjunction with the hash, loop and merge which are already understood for purpose of this question.

The relevant section from the documentation:

For SQL Data Warehouse and Parallel Data Warehouse, these join hints apply to INNER joins on two distribution incompatible columns. They can improve query performance by restricting the amount of data movement that occurs during query processing. The allowable join hints for SQL Data Warehouse and Parallel Data Warehouse are as follows:

REDUCE
Reduces the number of rows to be moved for the table on the right side of the join in order to make two distribution incompatible tables compatible. The REDUCE hint is also called a semi-join hint.

REPLICATE
Causes the values in the joining column from the table on the left side of the join to be replicated to all nodes. The table on the right is joined to the replicated version of those columns.

REDISTRIBUTE
Forces two data sources to be distributed on columns specified in the JOIN clause. For a distributed table, Parallel Data Warehouse will perform a shuffle move. For a replicated table, Parallel Data Warehouse will perform a trim move. To understand these move types, see the "DMS Query Plan Operations" section in the "Understanding Query Plans" topic in the Parallel Data Warehouse product documentation. This hint can improve performance when the query plan is using a broadcast move to resolve a distribution incompatible join.

Best Answer

In general terms we use a join hint is to correct a failure of the optimizer to identify a better query plan. This can happen for many reasons which broadly boil down to optimization being a "good enough" approach constrained by a time budget for finding a plan.

Reduce

"also called a semi-join hint". With a semi-join we are only interested in whether or not any matching row exists in the other table. Indeed, it is often because the query has an EXISTS() predicate. If there is one match or one million the result will be the same - the single row from the left (outer) table will be returned.

As an example, we may want to query all Customers who have ever placed an Order. Let's say Customer is a small table but Order is huge and spread over many nodes. It would be wasteful to copy all that data to wherever Customer is held when all we need is to know if at least one row exists for each Customer. The REDUCE hint provides this. Conceptually it is select distinct CustomerId from Order run on each node holding Order, which is relatively fast.

Replicate

This is the converse of REDUCE. Here a table is copied en masse from its "home" node to all nodes holding data from the table on the right-hand side of the join.

For this example we want the total value of all orders, summed by Customer city. We only have a relatively few customers which comfortably fit on a single node. The Orders table, however, is huge and spans many nodes. To satisfy the query we have to get both Customer and Order data onto the same node. It will be more efficient to copy Customer, since it is smaller. The REPLICATE hint forces that behaviour, making a complete copy of the Customer table on each node holding Order rows.

Redistribute

This is another way of stating how rows should be moved from their "home" node to another node so join predicates can be evaluated. This time we are asking the server to copy the data and distribute that copy according to some new distribution function such that the copied rows will end up on the right node to complete the join.

Say now we have a retail e-commerce site with lots of customers and lots of orders. Customers are distributed by CustomerId and Orders distributed by OrderDate. If we now want to join on Customer.CustomerId = Order.CustomerId the most efficient plan may be to copy Order and re-distribute according to CustomerId, then pass each new partition to the corresponding node. This differs from REPLICATE as there the whole table is copied to every node whereas here a portion of each node's data is copied over to other nodes. The specific way the re-distribution is handled will depend in the current distribution and the cardinalities.

Related Solutions

Sql-server – Combine Table Hints INDEX and FORCESEEK with Two Joins Not On PK

I would also test this rewriting (aasuming that sID is the primary key of docSVsys):

SELECT COUNT(*)              
FROM [docSVsys] AS d
WHERE EXISTS
      ( SELECT * 
        FROM [docSVtext] AS sv
        WHERE sv.[sID] = d.[sID] 
          AND sv.[value] = 'doug'
      )
   OR EXISTS
      ( SELECT * 
        FROM [docMVtext] AS mv
        WHERE mv.[sID] = d.[sID] 
          AND mv.[value] = 'doug'
      ) ;

Sql-server – Updating a big replicated Dimension (SQL Server PDW)

A few questions. 20 million rows isn't necessarily that large.

What process are you using to perform your updates and deletes right now?

Is the dimension a CLUSTERED COLUMNSTORE INDEX, CLUSTERED INDEX or HEAP?

Are you saying that there is movement whilst you update and delete this table or did you just see movement when you changed the table from replicated to distributed?

If it is the latter that is not surprising. You are unlikely to be join and aggregation compatible. If you are doing something to trigger the movement through your update / delete then we could look at that - although a concrete example would be helpful.

In general terms I would start by trying to keep the ETL simple.

Use CTAS against the dimension selecting only the rows you want to keep, union in any new rows and use CASE to pick up any changes (converting the UPDATE into a transform within the CTAS). Once complete you can then use a pair of RENAME OBJECT commands to switch from the current table to the new table. This gives you the added benefit of having a historic view of your table - which you can drop at your leisure.