Choosing sort and distkeys for redshift with multiple joins

redshift

I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:

users -> accounts -> subscriptions. These are in a 1:1:1 relationship. The joins are as follows:

users.id = accounts.user_id
accounts.id = subscriptions.accounts_id.

What is the optimal way to set the sort and dist keys on the accounts table in this situation?

Best Answer

The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:

FROM subscriptions
LEFT JOIN accounts
  ON  accounts.id = subscriptions.account_id
  AND accounts.user_id = subscriptions.user_id --Note this add'l condition
LEFT JOIN users
  ON  users.id=accounts.user_id

The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.

As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins

If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.

Related Solutions

Managing code and deployments to Amazon Redshift

In general there is very little tool support at present for Redshift. We do all of our automation and deployment via psql shell scripts and we manage the code base using git version control.

As far as schema changes we use a "migrations" approach (inspired by Rails) where we create ALTER TABLE …ADD COLUMN and ALTER TABLE …DROP COLUMN scripts to roll the database forward/back from one deployment to the next.

We also use 3 separate database on the same cluster for Dev/Test/Prod. We then specify the target database in the psql script using the -d option.

Hope that helps.

Redshift table not showing up in tables for schema

PG_TABLE_DEF in Redshift only returns information about tables that are visible to the user, in other words, it will only show you the tables which are in the schema(s) which are defined in variable search_path. If PG_TABLE_DEF does not return the expected results, verify that the search_path parameter is set correctly to include the relevant schema(s).

Try this -

mydb=# set search_path="$user",db;

Then run your query -

mydb=# select tablename from pg_table_def where schemaname = 'db';

Best Answer

Related Solutions

Managing code and deployments to Amazon Redshift

Redshift table not showing up in tables for schema

Related Question