Have a separate database for data warehousing

business-intelligencedata-warehouse

I have built a typical web application using PHP and MySQL for enterprise usage.

The MySQL database is 3NF as much as possible.

However, as time goes by, the requests for the data appear to be more for a data warehouse so that they can slice and dice the data in whatever way they want.

Hence I picked up a book the 3rd Edition Data Warehouse Toolkit of the Kimball Group.

Inside the authors described a Kimball DW/BI Architecture and using star schema for dimensional and fact tables for DW/BI purposes. See below.

enter image description here

My question is, since I am most familiar with PHP and MySQL, should I have the data warehouse star schema in a separate database from the original 3NF database?

No commercial BI tool will be used. Anything that looks like an OLAP or BI would likely be an open source software that I may augment myself.

Please advise.

EDIT

My entire 3NF MySQL database is currently about 20mb so far growing about 1 mb every month or so.

Best Answer

In response to your reply to my comment, I mentioned some of the areas in which PostgreSQL shines for DW work - notably CTEs (Common Table Expressions - AKA the WITH clause) and Window functions (AKA the OVER clause).

Without these in MySQL, you'll end up writing your own hacks to emulate these (and, no disrespect, but it'll take a while for your code to be bug free), whereas with PostgreSQL you get them out of the box. PostgreSQL also has superior GIS support (should you require it) and better JSON - only experimental in MySQL.

Take a look here

PostgreSQL has traditionally focused on reliability, data-integrity and integrated developer-focused features. It has an extremely sophisticated query planner, which is capable of joining relatively large numbers of tables efficiently.

MySQL, on the other hand, has traditionally focused on read-mostly web apps, usually written in PHP, where the principal concern is with optimising simple queries.

DW type apps need far more of the former than the latter.

Also, check this out - a very recent comparion of the two systems. My take on this is that PostgreSQL is the better of the two, esp. for DW work.

You could peruse these pages for PostgreSQL and MySQL features. There is a certain "religious war" element here, and I'm mindful of the fact that you already have MySQL experience which may be a clincher for you. I wish you all the best with your DW project.

[EDIT in response to OP's comment]

I would certainly urge you to have separate databases for your OLTP and your OLAP work - if that is within your budget. OLTP and OLAP are fundamentally different and if you mix them you'll have many conflicts - I know the pain caused by "managers" insisting on running reports against a live system during the day - so if I were you, I would have MySQL as my OLTP system and PostgreSQL as my OLAP.

But whatever configuration you choose, I strongly suggest separate servers.

If I were forced to choose one database and one only, it would be PostgreSQL without hesitation. MySQL was essentially lucky and in the right place at the right time with the right software during the first internet boom. IMHO, PostgreSQL is undoubtedly technically superior.

I bear MySQL no ill-will - I have used it a lot (clients!) and find that if you can live with or workaround its quirks, it can be quite performant and has some interesting features - notably the storage engine choices one can make (which one can't with PostgreSQL).

In the DW sphere for MySQL there are two interesting developments which may bode well for it in this area in the future - these are columnar store engines - Infinidb and Infobright. The problem at the moment is that Infinidb appears to have died - even though the code is still available, the installation guide has disappeared and ICE (Infobright Community Edition) is essentially crippleware. You may wish to evaluate the Entreprise edition.

I urge you to set up a few test cases - imagine what sort of reports you will be running and try them on both systems. Think especially about the complex reports required for OLAP systems - I believe that you will find the far richer PostgreSQL SQL language to be a persuasive argument in its favour.

Related Question