To answer your questions in order:
- The cube doesn't store medians, modes (or even averages), but you can write queries that calculate them and embed them as calculated measures in the cube. The ability to embed this sort of computation is one of the main unique selling points of OLAP technology.
- If you have a dimension that can identify individual rows (which could be a degenerate or 'fact' dimension derived from an identifier on the fact table) then you can do a query based on individual rows. However, OLAP works in terms of dimensions and aggregates, so you would need to have a dimension capable of identifying individual rows (with an aggregate composed of one value).
- Any OLAP tool can do what's described in (2), plus they generally support a mechanism known as 'drill-through' where the cube will return a bordereaux of the transactional data underlying a given slice that you drill through into.
If you want to do calculations that aren't directly available in the cube script, many OLAP tools such as the late, lamented ProClarity will allow you to formulate queries involving custom MDX based calculations. Unless the cube doesn't have the information you need to do the actual calculations, the custom MDX calculations should be able to support any computation you need.
Although OLAP queries are traditionally associated with statistical queries in aggregate, if you have a dimension that allows drill down to detail you need it is certainly possible to formulate queries that will calculate medians, percentiles or histogram queries from which modes can be inferred or computed.
For example, this has an example of a pareto analysis query, which is based on rankings.
Many cube products can operate in a hybrid or relational OLAP mode where they do not persist the data themselves but query it from an underlying database. In addition, pure ROLAP tools such as Business Objects, Report Builder or Discoverer can query from an underlying database and do work row by row. However, they tend to lack the sophistication of dedicated OLAP products, and they don't have much in the way of statistical analysis capability out of the box.
Think about a process or event that you want to analyze.
Let's say you were building Lougle Analytics, and want to analyze visits (as opposed to single-page requests) to your site. Visiting a website is a process.
A fact table represents a process or event you want to analyze, in this case it is a list of site visits. You can have as many fact tables as you want, one per process or event.
What things might be useful to you in analyzing site visits?
- Information about the web browser (brand, screen resolution, ...)
- Information about the user (country, state, city, ISP, based on their IP address)
- Information about when the visit started (year, quarter, month, week, day)
- The duration of the visit
- The referring page, landing page, and exit page (title, url path)
- The number of pages visited during the visit
Technically, you could put this all into one table - as you would in Excel, but that'd get real big real fast, so we'll do ONE level of normalization. Dimensional design is often just "put everything you need for an analysis in one big table", and then normalize out one level.
So your fact table will look like this:
web_browser_key bigint
ip_address_key bigint
start_date_key int
referring_page_key bigint
landing_page_key bigint
exit_page_key bigint
duration_seconds int
number_of_pages_visited int
It has a bunch of key values that refer to values in other tables†, and two non-keys. The non-keys are numeric values and are called Measures. Can you take the average of the web browser brand name? No, so it's a dimension. Can you take the average of the visit duration? Yes, so it's a measure.
The other tables are called Dimension tables, and the ip_address
dimension table might look like this:
ip_address_key bigserial primary key, /* use meaningless surrogate keys */
ip_address inet unique,
country text,
division text,
locality text,
latitude numeric(8,6),
longitude numeric(9,6)
Notice that it is not normalized: country could be derived from city (locality). However in a data warehouse, we care about facilitating analysis first. By facilitating analysis, I mean avoiding deep joins.
Notice that some of the data in the dimension table is hierarchical: country > division > locality. Your OLAP server (hopefully) understands hierarchies to support drill-down operations.
A logical Cube is just the collection of Dimensions and Measures that you are working with. Unlike a Rubik's cube, it can have more than 3 dimensions. Just think of a dimension as a column in your data set.
OLAP is a set of operations you can do against a data set, such a pivoting, slicing, dicing, drilling. Think of Excel PivotTables. An OLAP Server facilitates OLAP operations.
† usually without foreign keys
Best Answer
You need to start off with a
cross join
of detail and payment, because you say "products in any transactions that includes a certain type of payment are considered to be paid with that payment type", so you are effectively asking for parts of a transaction to be counted multiple times if there are multiple payment types used.