How to handle “many columns” in OLAP RDBMS

data-warehousedatabase-designolapstar-schema

I have a fact that has around 1K different numerical attributes (i.e. columns). I would like to store this in to a column-oriented DB and perform cube analysis on it.

I tried to design a star schema, but I'm not sure how to handle this many columns. Normalising it sounds wrong, but I can't just have flat columns either. The combination of attributes are also too diverse to have a simple dimension table for this, even if I'd reduce the numerical values into categories (ranges), which is an option. I thought about storing them as XML or JSON for each row, but that doesn't sound great either.

If it helps, I'm planning to use Amazon's redshift for the DB.

Note: We have strong preference for RedShift as it fits perfectly for at least other few operations we do on this data. Hence I want to avoid other technologies like HBase if possible.

Best Answer

Regarding why you would want to do this, imagine you want to see which words/short phrases in customer emails are associated with costly repairs, and you want to be able to analyze this using OLAP. It can be costly to tokenize/grammify many documents, so you might want to store the tokens/grams in a form which your OLAP server understands, ie columns.

Consider MonetDB, which allows for a practically unlimited number of columns.

Redshift maxes out at 1600 columns.

Another option would be to use Principal Component Analysis and only choose the top 1600 components, but that can make interpretation difficut.

Another option would be to use Postgres and store the tokenized strings or n-grams in a string array field, but your OLAP server would need to support that.