I have been researching Amazon's Redshift database as a possible future replacement for our data warehouse. My experience has always been in using dimensional modeling and Ralph Kimball's methods, so it was a little weird to see that Redshift doesn't support features such as the serial data type for auto incrementing columns.
There is, however, this recent blog post from the AWS Big Data blog about how to optimize Redshift for a star schema: https://blogs.aws.amazon.com/bigdata/post/Tx1WZP38ERPGK5K/Optimizing-for-Star-Schemas-and-Interleaved-Sorting-on-Amazon-Redshift
The question I have is about what is the best practice for loading a star schema in Redshift? I cannot find this answered in any of Redshift's documentation.
I'm leaning toward importing my files from S3 into staging tables and then using SQL to do the transformations such as lookups and generating surrogate keys before inserting into the destination tables.
Is this what others are currently doing? Is there an ETL tool worth the money to make this easier?
Best Answer
You are definately on the right track with Kimball rather than inmon for Redshift.
There are a number of patterns for this, I have used them all in different use cases
Note that Redshift sometimes works BETTER if you have a wide table with repeated values rather than a fact and dimensions. The reason for this is that the columnar approach lets Redshift compress the different values down to a level that is pretty efficient. I do not have a formula for when to use many Dimensions vs a flat wide table, the only way is to try it and see!
Some links
AWS DMS for Redshift taret
AWS Glue