Medians, Modes, Percentiles and OLAP

olapssasstatistics

I'm newbie trying to wrap my head around OLAP, and I have a few questions.

  • Question 1: Can an OLAP cube store medians, modes, percentiles?
  • Question 2: Can an user-written MDX query return a summary of row-level data? (ex: % transactions > $100). Or must the cube designer add this to the cube?
  • Question 3: Do any OLAP products now provide mechanisms for accessing row level data? Which?

Our IT Department are looking for feedback on what sort of issues we're having with a particular MS Analsis Services ROLAP cube. We don't have access to the relational database behind it and need to perform calculations that aren't currently available as measures in the cube.

Let me see if I have this right.

  1. A cube can provide statistics for counts, means, proportions, standard deviations.
  2. If a particular statistic hasn't been catered for in a measure provided by the cube designer, can we write an MDX query to get it? Or do they need to change the cube in order to pre-calculate it from the row level data?
  3. A cube cannot provide statistics likes medians, modes or percentiles, beacuse these statistics don't aggregrate properly.

I'm reading Leland Wilkinson's The Grammar of Graphics and in his chapter on Data Mining and OLAP, he says

These [cube operations] work well with statistics like counts, means,
proportions, and standard deviations. Simple aggregations over
subclasses can be computed by operating on sums, sums of squares, and
other terms that are combined in linear functions to produce basic
summary statistics.

They do not work properly with statistics like the median, mode and
percentiles because the aggregate of these statistics is not the
statistic of their aggregates. The median of medians is not the median
of the aggregate, for example.

He goes on to add:

A more sophisticated ROLAP model has emerged recently, however. It is
possible, through several technologies, to give statistical algorithms
access to raw data through the relational model in real time.
This approach is more promising than the rigid aggregations offered by
structures such as data cubes.

In the most elegant form of this
architecture, applications can request remote connections to provide
information about their data-handling methods and take suitable action
depending on the returned information. In this form, component
architecture can achieve the real promise of distributed computing:
design and execution that are independent of site, operating system,
or language.

That was written circa 2005. Is anyone aware of products employing this methodology to allow for row-level data access?

Best Answer

To answer your questions in order:

  1. The cube doesn't store medians, modes (or even averages), but you can write queries that calculate them and embed them as calculated measures in the cube. The ability to embed this sort of computation is one of the main unique selling points of OLAP technology.
  2. If you have a dimension that can identify individual rows (which could be a degenerate or 'fact' dimension derived from an identifier on the fact table) then you can do a query based on individual rows. However, OLAP works in terms of dimensions and aggregates, so you would need to have a dimension capable of identifying individual rows (with an aggregate composed of one value).
  3. Any OLAP tool can do what's described in (2), plus they generally support a mechanism known as 'drill-through' where the cube will return a bordereaux of the transactional data underlying a given slice that you drill through into.

If you want to do calculations that aren't directly available in the cube script, many OLAP tools such as the late, lamented ProClarity will allow you to formulate queries involving custom MDX based calculations. Unless the cube doesn't have the information you need to do the actual calculations, the custom MDX calculations should be able to support any computation you need.

Although OLAP queries are traditionally associated with statistical queries in aggregate, if you have a dimension that allows drill down to detail you need it is certainly possible to formulate queries that will calculate medians, percentiles or histogram queries from which modes can be inferred or computed.

For example, this has an example of a pareto analysis query, which is based on rankings.

Many cube products can operate in a hybrid or relational OLAP mode where they do not persist the data themselves but query it from an underlying database. In addition, pure ROLAP tools such as Business Objects, Report Builder or Discoverer can query from an underlying database and do work row by row. However, they tend to lack the sophistication of dedicated OLAP products, and they don't have much in the way of statistical analysis capability out of the box.