How to accomplish local joins between facts and dimensions

greenplum

A fact table typically contains a composite primary key made up of the keys to all the dimensions that define its granularity. if we distribute the table by this key, it cannot possibly accomplish a local join to any of the dimensions to which it is bound because those tables are distributed by their own PKs, which won't match the hash generated for the fact.

I'm guessing that if dimensions are small it won't matter much, but if the dimension is large e.g. a security master table with millions of CUSIPs and other identifiers, how can local joins be accomplished?

what is the standard approach to selecting a distribution key for facts?

Best Answer

Your distribution approach by the composite primary key will create a distribution with low or no skew. That's great but, as you point out, it comes with some costs in other areas. In practice even distribution is often a secondary concern. Distribution tends to be by the largest commonly used join or the most common aggregation. In the case you describe this sounds to be the security descriptor dimension.

Data distribution in a MPP database aims to optimise two metrics: minimising the movement of data and making the use of the all the available hardware performance in the cluster. Minimising data motion is achieved by co-locating data for large join and aggregate operations. Using all the hardware to get the best performance for a query requires that the data is evenly distributed and that typical queries will not be working on data on a single node. So when picking a distribution key to minimise data motion distribution it's not only important to minimise skew but also to avoid distributing by a field which is common used in query predicates.