Choosing sort and distkeys for redshift with multiple joins

redshift

I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:

users -> accounts -> subscriptions. These are in a 1:1:1 relationship. The joins are as follows:

users.id = accounts.user_id
accounts.id = subscriptions.accounts_id.

What is the optimal way to set the sort and dist keys on the accounts table in this situation?

Best Answer

The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:

FROM subscriptions
LEFT JOIN accounts
  ON  accounts.id = subscriptions.account_id
  AND accounts.user_id = subscriptions.user_id --Note this add'l condition
LEFT JOIN users
  ON  users.id=accounts.user_id

The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.

As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins

If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.