I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:
users
-> accounts
-> subscriptions
. These are in a 1:1:1 relationship. The joins are as follows:
users.id
= accounts.user_id
accounts.id
= subscriptions.accounts_id
.
What is the optimal way to set the sort and dist keys on the accounts
table in this situation?
Best Answer
The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:
The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.
As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins
If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.