What advantages does in-memory OLAP have over traditional systems with significant memory

business-intelligencedata-warehousein-memory-databaseolap

Do in-memory OLAP engines have advantages over the traditional OLAP engines backed by enough RAM to contain the entire cube(s)?

For example, if I use a MOLAP engine (SSAS) and GB / TB of RAM where the entire cube (or even star-schema) is RAM resident, what is the difference compared to something like TM1 / SAP HANA?

Best Answer

Databases designed with the assumption that they will be entirely resident in main memory can use structures such as T-tree indexes. But the real advantage is, IMDBs are just simpler. They do less (as they don't have to worry about managing a cache, or serializing writes for consistency, or anything to do with ACID-compliant I/O at all) so they execute fewer instructions on the hardware to carry out the same "work". A general-purpose database has to be all things to all people; like a Leatherman has a dozen tools, but sometimes you just need a cutting edge, so you buy a blade from Cold Steel, and no-one debates that it is a better knife!

Some decision guidance

Aggregates (min/max/avg/...) on a single metric-type? -> "star"
Aggregates considering all or many different metrics? -> "EAV"

Do these attributes describe a common entity? -> "star"
Or is it just a bunch of numbers that may be grouped / split up any other way? -> "EAV"

Your data is written once and then never changed? -> "star"
Or do you run UPDATES on selected metrics? -> "EAV"

Is your set of 500 metrics complete? The set hardly ever or never changes? -> "star"
New metrics are added, existing metrics are dropped all the time? -> "EAV"

Storage size

Concerning your comment:

Storage is less important now for optimization, we are focusing on query times.

Storage size is a major factor for query times. The number of data pages that have to be read to satisfy a query is probably the most important single factor for performance.

Let's start with your casual remark: Data type is int or double.

int occupies 4 bytes.
double occupies 8 bytes.

Assuming all columns are NOT NULL, 500 integer columns, plus 1 timestamp plus row overhead (no padding) would occupy 2036 bytes in the "star" schema. No compression possible. Here is how you calculate that:
Configuring PostgreSQL for read performance

If you mix int with double, be sure not to waste space for padding. For instance, group integer and double metrics.

In the "EAV" model, you'd need at least 44 or 52 bytes per row. 22000 or 26000 bytes for one timestamp. 11 - 13 times as much. That matters. For one hour's worth of data, you need to fetch 2000 data pages or more (default page size 8k) vs. around 180 pages for the "star" schema.

Here are some tools to measure size:
Measure the size of a PostgreSQL table row

I think storage size can be the key to performance here. If you are focusing on query times, and I had to make a wild guess, the "star" schema is probably the better choice.

But as I said, it depends on a lot of details.

Either way, you may be interested in cross tabulation ("pivot table"). The tablefunc module provides the crosstab() function:
Have aggregate function group results in one row

SQL Server – How to Calculate RAM/CPU Specifications for Data Warehouse

Great question, and I did a session about this at TechEd a few years ago called Building the Fastest SQL Servers:

https://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/DBI328

In it, I explain that for data warehouses, you need storage that can provide data fast enough for SQL Server to consume it. Microsoft built a great series of white papers called the Fast Track Data Warehouse Reference Architecture that goes into hardware details, but the basic idea is that your storage needs to be able to provide 200-300MB/sec sequential read performance, per CPU core, in order to keep the CPUs busy.

The more of your data that you can cache in memory, the slower storage you can get away with. But you've got less memory than required to cache the fact tables that you're dealing with, so storage speed becomes very important.

Here's your next steps:

Watch that video
Test your storage with CrystalDiskMark (Here's how)
With 4 cores, you'll want at least 800MB/sec of sequential read throughput
If you don't have that, consider adding memory until the pain goes away (and caching the entire database in RAM isn't unthinkable)

Say you've got a 200GB database that you're dealing with, and you can't get enough storage throughput to keep your cores busy. It's not unthinkable to need not just 200GB of RAM, but even more - because after all, SSIS and SSAS really want to do their work in memory, so you have to have the engine's data available, plus work space for SSIS and SSAS.

This is also why people try to separate out SSIS and SSAS onto different VMs - they all need memory simultaneously.

Best Answer

Related Solutions

Postgresql – Schema for analytics table in Postgres

Some decision guidance

Storage size

SQL Server – How to Calculate RAM/CPU Specifications for Data Warehouse