Standard way to replace codes with values from a lookup table for reporting or analytics

data-warehouseetljoin;performance

Suppose I have a "main" table that was probably heavily normalized and consists largely of columns that merely contain codes that are lookups into other tables (they are probably foreign keys but feel free not to assume so in your answer if it makes sense). For instance, maybe this main table is a list of people and one of the columns is for favorite_color. Of course then our lookup table would looks something like this:

color_cd    color_name
0           Unknown
1           Red
2           Blue
3           Orange
4           Other

Now if I'm creating reports from this main table or perhaps feeding this data into some software for domain experts to use, I'd rather replace the favorite_color code with the color_name from the lookup table. One solution of course is to do this with a simple JOIN in my SELECT:

SELECT main.col1, main.col2, colors.color_name as `Favorite Color` 
  FROM main LEFT OUTER JOIN colors ON (main.favorite_color = colors.color_cd);

This gets the job done. Now suppose that the main table has dozens or maybe hundreds of such columns and millions of rows. Some of the lookup tables are small, like the above example, but others might have hundreds or thousands of values.

Is there any sort of standard way of doing this transformation, or is the writing a long with dozens or hundreds of joins the way to go? Is this type of thing a typical component of any ETL product? Is it preferable to write a script that handles this without joins and outside of SQL? Please forgive what are likely some abuses or misuses of terminology; I'm not even sure if the above process has a name.

Best Answer

In the ETL process, you can replace the codes with names while loading a target table. I'll focus on Informatica PowerCenter but I'm sure other ETL tools offer a similar feature.

There is a Lookup transformation that is used to look up values (DNAME) from a a relational table (it may also be a view or a flat file), based on a defined lookup match criteria (source.DEPTNO = lookup.DEPTNO). These values may then be appended to the source rows and stored in a target table (that is used for reporting).

Lookup transformation

When a session is executed, a SELECT statement is generated for each lookup in the mapping. These statements are run against the data source once and the results are stored in the lookup cache. Later, when a value needs to be looked up, the transformation uses the cache.

Some decision guidance

Aggregates (min/max/avg/...) on a single metric-type? -> "star"
Aggregates considering all or many different metrics? -> "EAV"

Do these attributes describe a common entity? -> "star"
Or is it just a bunch of numbers that may be grouped / split up any other way? -> "EAV"

Your data is written once and then never changed? -> "star"
Or do you run UPDATES on selected metrics? -> "EAV"

Is your set of 500 metrics complete? The set hardly ever or never changes? -> "star"
New metrics are added, existing metrics are dropped all the time? -> "EAV"

Storage size

Concerning your comment:

Storage is less important now for optimization, we are focusing on query times.

Storage size is a major factor for query times. The number of data pages that have to be read to satisfy a query is probably the most important single factor for performance.

Let's start with your casual remark: Data type is int or double.

int occupies 4 bytes.
double occupies 8 bytes.

Assuming all columns are NOT NULL, 500 integer columns, plus 1 timestamp plus row overhead (no padding) would occupy 2036 bytes in the "star" schema. No compression possible. Here is how you calculate that:
Configuring PostgreSQL for read performance

If you mix int with double, be sure not to waste space for padding. For instance, group integer and double metrics.

In the "EAV" model, you'd need at least 44 or 52 bytes per row. 22000 or 26000 bytes for one timestamp. 11 - 13 times as much. That matters. For one hour's worth of data, you need to fetch 2000 data pages or more (default page size 8k) vs. around 180 pages for the "star" schema.

Here are some tools to measure size:
Measure the size of a PostgreSQL table row

I think storage size can be the key to performance here. If you are focusing on query times, and I had to make a wild guess, the "star" schema is probably the better choice.

But as I said, it depends on a lot of details.

Either way, you may be interested in cross tabulation ("pivot table"). The tablefunc module provides the crosstab() function:
Have aggregate function group results in one row

MySQL update one table from another joined many-to-many

This works - I had the JOIN wrong

UPDATE `events` 
  INNER JOIN `plays_in_events` ON (`plays_in_events`.`EventID` = `events`.`EventID`)
  INNER JOIN `plays` ON (`plays_in_events`.`PlayID` = `plays`.`PlayID`)
  SET events.Name = plays.Play
WHERE 
  events.Name IS NULL and plays_in_events.PlayID = 1

and the earlier comments were quite right, but it took me a while to 'get it' - thanks for your help. I had just copied the JOIN lines from the working SELECT query, but when I looked again, I could finally see what the commentators meant about events appearing twice, and plays_in_events not appearing properly in the JOINs.

Best Answer

Related Solutions

Postgresql – Schema for analytics table in Postgres

Some decision guidance

Storage size

MySQL update one table from another joined many-to-many

Related Question