Dimensional Modeling and ETL in Redshift

dimensional-modelingetlredshift

I have been researching Amazon's Redshift database as a possible future replacement for our data warehouse. My experience has always been in using dimensional modeling and Ralph Kimball's methods, so it was a little weird to see that Redshift doesn't support features such as the serial data type for auto incrementing columns.

There is, however, this recent blog post from the AWS Big Data blog about how to optimize Redshift for a star schema: https://blogs.aws.amazon.com/bigdata/post/Tx1WZP38ERPGK5K/Optimizing-for-Star-Schemas-and-Interleaved-Sorting-on-Amazon-Redshift

The question I have is about what is the best practice for loading a star schema in Redshift? I cannot find this answered in any of Redshift's documentation.

I'm leaning toward importing my files from S3 into staging tables and then using SQL to do the transformations such as lookups and generating surrogate keys before inserting into the destination tables.

Is this what others are currently doing? Is there an ETL tool worth the money to make this easier?

Best Answer

You are definately on the right track with Kimball rather than inmon for Redshift.

There are a number of patterns for this, I have used them all in different use cases

  1. "ELT" pattern - Load the source tables to redshift fully, do not do any significant transformations until the data has been loaded. For this you can either load to s3, then use redshift copy command or I would recommend using "AWS data migration services", which can sync a source (e.g.mysql or postgres) to a target (e.g. redshift) Then, on a regular basis run sql processes within redshift to populate dims then facts. You can use third part cloud based tools to "simplify" this process if you want to - such as Matillion (i do not recommend using a third party tool)
  2. "ETL pattern" - Transform the data in flight, using apache spark. and load the dims and facts into redshift spark->s3->redshift. I have used EMR for this which is good. this is also the approach taken if you use AWS Glue
  3. Do not transform ! - similar to 1) but just use the tables that have been loaded.

Note that Redshift sometimes works BETTER if you have a wide table with repeated values rather than a fact and dimensions. The reason for this is that the columnar approach lets Redshift compress the different values down to a level that is pretty efficient. I do not have a formula for when to use many Dimensions vs a flat wide table, the only way is to try it and see!

Some links

AWS DMS for Redshift taret

AWS Glue