Consistent data in dimension and fact tables from multiple incrementally loaded staging tables

data-warehousessis

To create a data model for our data warehouse we use tooling supplied by the ERP vendor. This does probably matter due to the fact that is has it's limitations. We inherited this environment with a certain design. We were new to data warehousing and performing this as only a part of our job, so we had a steap learning curve. 🙂 Our basic design for our data warehouse is like this:

[source] -> [staging table] -> [Persistent Staging Area table] -> [set of views] -> [dimension/fact table]

staging table: has only 1 source table, truncated before load, only the delta of records since yesterday is loaded
Persistent Staging Area table: never truncated, loaded with delta records of staging table. So the result is that records are never deleted, current records are updated based on the natural key.

All dimensions and fact tables are truncated and re-loaded every night. This is possible due to the persistent staging area. No history is required in the dimension or fact tables currently. This is probably designed like this in the past, because you are able to completely rebuild all the dimension and fact tables if you like. It makes changes a little more easy to implement, since you do not have to backup the data every time, etc.

We are re-thinking our data warehouse design, since we have learned a lot in the past years. 🙂 We have ETL performance issues, so we want to look at incrementally loading the dimension and fact tables, but are struggling with the following issue.

Let's say we cut out the Persistent Staging Area layer, so we only have the staging tables which are loaded with delta records only. We have a view C that combines data from source table A and B. This view C is the source for dimension table D and fact table F. (this is a very simplified example)

Now, a column value for a record in table A changes. This column value is an attribute in the dimension table D. Since the view C is based on 2 staging tables, which are incrementally loaded, we will see this records in view C depending on the join type. Let's say it is a left outer join. We only see NULL values for table B fields, together with this changed column value. This will enter the dimension table D as NULL values for fields of table B and the field value from table A. This is unwanted of course, since it makes the data inconsistent. At this moment, this problem is solved with using a Persistent Staging Area. With the Persistent Staging Area, the record in there will be updated and propagate correctly to our dimension, since that is reloaded every night. I hope I have explained it clearly.

So we want to look at cutting out the Persistent Staging Area layer, but are not sure how to cope with changes like this. So the scenario where we load only changes to our staging tables and truncate these prior to re-load (to load the new changes). Iam not sure how you would normally solve this. There is probably always some sort of temporary staging required for this between the staging table and the dimension or fact table? Or am I missing something here?

So my question is not about the delta load of the staging tables, I know about CDC, or that trunacting and reloading our dimension and fact tables is bad practice, but Iam probably missing something crucial in how you would bring your data from your staging tables (with delta records only) to your dimension/fact tables (which are combined from many source tables) and only 1 source record changes in a consistent manner. There should be some intermediate staging to make things consistent right?

Update following questions bbaird:

  1. No, we want to see if changing the structure will better support our needs and improve performance. We think that incrementally loading dimensions and facts, as well as removing the PSA will improve performance. Keeping historic records will be done in the dimension and fact tables and not in a PSA anymore.

  2. I have tried to illustrate the current and required future situation. Transforming is done by creating views and sometimes maybe a intermediate staging table. This is how the tooling works and we are used building transformations at this moment. We want to investigate other possibilities to replace the current data warehouse tooling. The picture of the future situation illustrates what will happen if for example a new records i found in table B, but not in Table C. Since that natural key is missing in table A we are going to miss this record or get NULL values, which would make the dimension contain inconsistent data. I would think that such a model would not be a correct model from a functional perspective by the way.

So I think am missing something here. Iam not sure how this is build using other tooling than ours. I do not have that experience yet. In my opinion, to make such a model contain consistent data, you would need some form of persistent staging. My guess is just that such models as I describe above, are just not correct from a functional perspective, but Iam not sure.

Current situation

Future situation

Edit 2:

I have added a data example and also changed the view's join type to a left outer join. This will show exactly the behaviour Iam trying to illustrate. I hope it is clear. I did not save some changes, so I partially had to rebuild the drawing and it might look a little weird because of that.

Day 1 data example

Day 2 data example

Edit 3: The difference between the current and future situation is that the new records will be loaded in the current situation with the data from Table B. In the new situation, the Table B data would NOT be loaded in the dimension. Is my thinking wrong or is there something wrong with the design? I have added a data example of the current situation. You now have an example how the different situations would behave when inserting a new record on day 1 and day 2. On day 2 you can see that in the current situation the data for the new record in Table B would be loaded into the dimension. In the new situation the data for Table B would not be loaded into the dimension, because it is a left outer join, incremental staging table load with truncate every night AND there is no PSA. I hope this makes things clear.

enter image description here

enter image description here

Best Answer

Ok, in the Table A/Table B scenario you outline, we have three possibilities:

  1. Table A and Table B update/insert at the same time - existing logic works
  2. Table A XOR Table B updates - can locate record based on ColA, update only the records for the table received.
  3. Table A XOR Table B inserts - hold the record in staging until the other record arrives. Flag if it sits longer than x minutes/hours/days.

Looks like #3 is what you're trying to resolve, and should be solved if you hold the record in staging until its match arrives. There could be instances where new records for Table A and Table B arrive at the same time, if you have timestamps this won't be a problem to resolve.

Keeping an audit table will assist resolving any discrepancies that might be caused by scenario 2.