As commented, arguments from both sides are valid. Let's call them "star" (the flattened schema of your DBA) and "EAV" (entity-attribute-value). The latter can serve as a hint. Details in this related answer:
Is there a name for this database structure?
Well, if your 500 metrics are of well known type and you don't invent new ones / drop old ones all the time, it's not the worst case EAV scenario, but similar.
There is no "standard" way to do it. The more flexible approach would be the "EAV" schema. Then you can easily add and delete metric-types: add a row to the referenced metric_type
table, or delete one (cascading to metrics table). That would need a schema-change and be more expensive in the "star" model.
You can do quite a bit for either model with smart covering / partial / multicolumn indexes.
Some decision guidance
Aggregates (min/max/avg/...) on a single metric-type? -> "star"
Aggregates considering all or many different metrics? -> "EAV"
Do these attributes describe a common entity? -> "star"
Or is it just a bunch of numbers that may be grouped / split up any other way? -> "EAV"
Your data is written once and then never changed? -> "star"
Or do you run UPDATES on selected metrics? -> "EAV"
Is your set of 500 metrics complete? The set hardly ever or never changes? -> "star"
New metrics are added, existing metrics are dropped all the time? -> "EAV"
Storage size
Concerning your comment:
Storage is less important now for optimization, we are focusing on query times.
Storage size is a major factor for query times. The number of data pages that have to be read to satisfy a query is probably the most important single factor for performance.
Let's start with your casual remark: Data type is int or double
.
int
occupies 4 bytes.
double
occupies 8 bytes.
Assuming all columns are NOT NULL
, 500 integer columns, plus 1 timestamp plus row overhead (no padding) would occupy 2036 bytes in the "star" schema. No compression possible. Here is how you calculate that:
Configuring PostgreSQL for read performance
If you mix int
with double
, be sure not to waste space for padding. For instance, group integer and double metrics.
In the "EAV" model, you'd need at least 44 or 52 bytes per row. 22000 or 26000 bytes for one timestamp. 11 - 13 times as much. That matters. For one hour's worth of data, you need to fetch 2000 data pages or more (default page size 8k) vs. around 180 pages for the "star" schema.
Here are some tools to measure size:
Measure the size of a PostgreSQL table row
I think storage size can be the key to performance here. If you are focusing on query times
, and I had to make a wild guess, the "star" schema is probably the better choice.
But as I said, it depends on a lot of details.
Either way, you may be interested in cross tabulation ("pivot table"). The tablefunc
module provides the crosstab()
function:
Have aggregate function group results in one row
This works - I had the JOIN wrong
UPDATE `events`
INNER JOIN `plays_in_events` ON (`plays_in_events`.`EventID` = `events`.`EventID`)
INNER JOIN `plays` ON (`plays_in_events`.`PlayID` = `plays`.`PlayID`)
SET events.Name = plays.Play
WHERE
events.Name IS NULL and plays_in_events.PlayID = 1
and the earlier comments were quite right, but it took me a while to 'get it' - thanks for your help. I had just copied the JOIN lines from the working SELECT query, but when I looked again, I could finally see what the commentators meant about events
appearing twice, and plays_in_events
not appearing properly in the JOINs.
Best Answer
In the ETL process, you can replace the codes with names while loading a target table. I'll focus on Informatica PowerCenter but I'm sure other ETL tools offer a similar feature.
There is a
Lookup
transformation that is used to look up values (DNAME
) from a a relational table (it may also be a view or a flat file), based on a defined lookup match criteria (source.DEPTNO = lookup.DEPTNO
). These values may then be appended to the source rows and stored in a target table (that is used for reporting).When a session is executed, a
SELECT
statement is generated for each lookup in the mapping. These statements are run against the data source once and the results are stored in the lookup cache. Later, when a value needs to be looked up, the transformation uses the cache.