Sql-server – Header and line item data source mismatch

data-warehousefacttablesql server

I'm working with a star schema for a data warehouse and I am running into a problem with header and line items from different data sources.

CREATE TABLE DataSourceAHeader
(
     OrderId INT NOT NULL
    ,TotalCost MONEY NOT NULL
    -- Date, etc...
);

CREATE TABLE DataSourceALine
(
     OrderId INT NOT NULL
    ,LineNumber INT NOT NULL
    -- Dates, etc...
);

CREATE TABLE DataSourceBLine
(
     OrderId INT NOT NULL
    ,Cost MONEY NOT NULL
    ,LineNumber INT NOT NULL
);

I have data sources A and B which represent the same data in different ways. Data source A contains headers and line items, but it only has the net outcome (Total Cost) in the header. Data source B contains only line items and each item has an outcome (Cost).

I could keep two fact tables (one for the header and one for the line items), but I have researched and it seems inadvisable. Is there a strategy to deal with this kind of mismatched format or should they be stored in separate data warehouses (one warehouse per data source)?

My current strategy:

CREATE TABLE Fact.Order
(
     Id BIGINT IDENTITY PRIMARY KEY
    ,OrderId INT NOT NULL
    ,Cost MONEY NOT NULL
    -- Date key, etc...
);

CREATE TABLE Fact.OrderLine
(
     Id BIGINT IDENTITY PRIMARY KEY
    ,OrderFactId BIGINT NOT NULL REFERENCES Fact.Order (Id)
    ,LineNumber INT NOT NULL
    -- related line stuff
);

DataSourceAHeader and DataSourceBLine are inserted into Order and OrderLine. DataSourceBLine is split one line per row.

Here is an example for a DataSourceAHeader and DataSourceALine

SELECT * FROM Fact.Order;
|------------------------------------|
|   Id   |   OrderId   |   Cost      |
|   1    |     1100    |   12000.00  |
|   2    |     1101    |   10000.00  |
|------------------------------------|

SELECT * FROM Fact.OrderLine;
|-------------------------------------------|
|   Id   |   OrderFactId   |   LineNumber   |
|   1    |        1        |       1        |
|   2    |        1        |       2        |
|   3    |        1        |       3        |
|   4    |        2        |       1        |
|   5    |        2        |       2        |
|   6    |        2        |       3        |
|-------------------------------------------|

Here is an example for a DataSourceBLine

SELECT * FROM Fact.Order;
|---------------------------------|
|   Id   |   OrderId   |   Cost   |
|   1    |     1000    |   12.00  |
|   2    |     1000    |   10.00  |
|---------------------------------|

SELECT * FROM Fact.OrderLine;
|-------------------------------------------|
|   Id   |   OrderFactId   |   LineNumber   |
|   1    |        1        |       1        |
|   2    |        2        |       2        |
|-------------------------------------------|

Edit:

the TotalCost in the header cannot be brought down to the line level. I chatted with an architect acquaintance and his advice was to implement two separate fact tables, one for header (summary) and one for the lines (detail), and just have NULL values for the missing line information for DataSourceA.

Edit2:

I'm trying to be generic with the OrderId since I have several more data sources that may contain similar OrderId schemes (collisions). I have implemented a Mapping table in order to translate the source identifiers into the warehouse.

Edit3:

With the intention that this question be helpful to more than just myself, I would like the answer to have the following details
(mostly to compile what everyone has already reasoned about):

  • In general what are the approaches to resolving related disjoint data sets taking the form of summary/detail (single fact table or summary/detail fact tables)?
  • What are the drawbacks to each approach?
  • What kind of structure could the fact table take to cope with missing (or irrelevant) data?
  • (two fact table approach) In what cases would it be prudent to roll down the summary versus rolling up the details?

Best Answer

If you want to de-normalize this into a single fact table, the fact table is going to be about line items. Therefore, the facts from DataSourceAHeader need to be split up and distributed to the relevant line items so they are not duplicated. As it is currently presented, that means dropping your total order cost and calculating this by summing the line item costs.

The DataSourceAHeader dimension keys (e.g. order date) can be taken from DataSourceAHeader and applied to the fact rows generated from DataSourceBLine. In the example there doesn't seem to be any info contained on DataSourceALine which isn't already included on either DataSourceAHeader or DataSourceBLine, but if there is this can be mapped across in a similar way.

This approach relies on a number of assumptions, the key one being that all the facts from DataSourceAHeader can accurately be distributed among its constituent line items. If this isn't true, loading two separate fact tables (one for the order and one for the line items) might well be a better approach. The same might be true if there are a lot of questions to be asked about orders, which do not consider line item specific info. This is labelled as "Bad Idea #2" in the article which you've referenced, but I have found that in certain circumstances, it's actually a good idea.

Finally this assumes that the two data sources are in sync. If they're not, you'll be limiting yourself to loading data at the pace of the slower data source. This might be fine, but needs to be considered in the context of your needs and the difference between the two data sources.

Edit: De-normalizing into a single fact table may significantly impact performance when counting orders, as it's essentially a distinct count, which would be my main reason for considering two separate fact tables.

Edit 2 (in response to question edit):

Here, the issue is that at the most granular level (line) data is incomplete, in as much as not all rows have a cost value. However, the total cost information is available at the next level up (header). This presents the situation where you cannot derive the higher level from the lower; let’s consider the resulting options:

  1. Have a single fact table at the lowest granularity available (line). This is a non-starter, as we are now relying on the incomplete line data to answer questions at the higher level, which we know we could have answered.
  2. Have a single fact table at the higher granularity (header). This means we can now answer questions at the higher level with the complete data, but can no longer answer questions at the more granular level at all. This may be considered to be acceptable, but in most cases we are throwing away potentially valuable data.
  3. Have two related fact tables, one for the incomplete, more granular data (line) and one for the complete, less granular data (header). This is the ideal solution, as we can now answer questions at the higher level in full, and can give the best possible answer to questions at the lower level, given the incompleteness of the source data.

This question was raised because of doubts about having two related fact tables. The doubts stem from the fact that maintaining and joining two large fact tables can be resource intensive. That's true, and if your most granular information can be used to provide a full description of the situation then using a single fact table is preferable. However, in situations like this where that's not possible, two fact tables are required if you want to preserve as much information as possible.