Data Warehouse – Using for Non-BI Data: Best Practices

data-warehouseetl

The Primary goal is the case-optimized data representation. The idea is to gather data from several databases, save the data in a data-warehouse and restructure/connect the data in data marts(similar to a view across multiple databases). Basically the intention is to make it easier to access the data that is spread across databases.

Data Marts

As you can see in the picture the plan is to have dependent data marts(shown in the upper half of the picture) to avoid having multiple Clean and Load processes like independet data marts have (shown in the bottom of the picture). The BI aspect of the data-warehouse will also be used but is not relevant for my question. Only the data in the DB will be changed. The data represented in the DW is just for reading purposes. So the basic concept is not hurt. But I still feel like the DW is misused here.

To anwser my question please include following points:

  • Is the DW misused?
  • Are there any DW's with dataconnectors that fit (near) real-time
    capabilities?
  • What alternatives fit these requirements to connect data from
    multiple DB's and represent them in a clean and structured way in
    real-time without having tons of Clean and Load processes?

Best Answer

The first (top) example is essentially the most common ways to design the ETL process for a BI datawarehouse and datamarts. You don't want the lower example, as you'll need to build, deploy/version and troubleshoot multiple instances of the same code.

As for "misuse" or not, it all comes down to how you design the datamart. Typically, star schemas are designed for analytics, but there's nothing to say it won't work for OLTP-type work. I would consider the amount of work required to remodel the data when loading - is it worth the extra complexity, or would you be better off querying the source database directly from your app? Maybe a database view in the source database(s) could do the same, with a lot less complexity?

Be really careful with "real-time" BI implementations: The more "real-time" you build the solution, the more complex it inevitably gets.

On the one hand, a process that does a truncate-and-load every night is relatively simple to build and run.

And incremental loads that run every hour, and you need to start identifying new, changed and perhaps even deleted (!) rows in the source data. This may be hard or even impossible, depending on how the source data is structured or shipped to you.

A full-on "real-time" solution (within quotes, because unless you have just views on the source database, it's not really realtime) will probably need some kind of change capture, trigger or similar complex logic for every dependent table. All of those things can potentially affect the source databases adversely with regards to performance and stability.

It gets even more complex and costly when those tables interact, say for example, if you need to join snowflake schemas into a star schema.

To answer your third question, pretty much any ETL tool (SSIS, Pentaho, etc) will support basic datawarehouse ETL. Once you add near-realtime logic, the choice of platform ETL tool becomes less relevant (you'll probably need to script most of this yourself anyway), and the solution becomes very dependent on the database platforms of the source data as well the infrastructure and platform of your datawarehouse.