As commented, arguments from both sides are valid. Let's call them "star" (the flattened schema of your DBA) and "EAV" (entity-attribute-value). The latter can serve as a hint. Details in this related answer:
Is there a name for this database structure?
Well, if your 500 metrics are of well known type and you don't invent new ones / drop old ones all the time, it's not the worst case EAV scenario, but similar.
There is no "standard" way to do it. The more flexible approach would be the "EAV" schema. Then you can easily add and delete metric-types: add a row to the referenced metric_type
table, or delete one (cascading to metrics table). That would need a schema-change and be more expensive in the "star" model.
You can do quite a bit for either model with smart covering / partial / multicolumn indexes.
Some decision guidance
Aggregates (min/max/avg/...) on a single metric-type? -> "star"
Aggregates considering all or many different metrics? -> "EAV"
Do these attributes describe a common entity? -> "star"
Or is it just a bunch of numbers that may be grouped / split up any other way? -> "EAV"
Your data is written once and then never changed? -> "star"
Or do you run UPDATES on selected metrics? -> "EAV"
Is your set of 500 metrics complete? The set hardly ever or never changes? -> "star"
New metrics are added, existing metrics are dropped all the time? -> "EAV"
Storage size
Concerning your comment:
Storage is less important now for optimization, we are focusing on query times.
Storage size is a major factor for query times. The number of data pages that have to be read to satisfy a query is probably the most important single factor for performance.
Let's start with your casual remark: Data type is int or double
.
int
occupies 4 bytes.
double
occupies 8 bytes.
Assuming all columns are NOT NULL
, 500 integer columns, plus 1 timestamp plus row overhead (no padding) would occupy 2036 bytes in the "star" schema. No compression possible. Here is how you calculate that:
Configuring PostgreSQL for read performance
If you mix int
with double
, be sure not to waste space for padding. For instance, group integer and double metrics.
In the "EAV" model, you'd need at least 44 or 52 bytes per row. 22000 or 26000 bytes for one timestamp. 11 - 13 times as much. That matters. For one hour's worth of data, you need to fetch 2000 data pages or more (default page size 8k) vs. around 180 pages for the "star" schema.
Here are some tools to measure size:
Measure the size of a PostgreSQL table row
I think storage size can be the key to performance here. If you are focusing on query times
, and I had to make a wild guess, the "star" schema is probably the better choice.
But as I said, it depends on a lot of details.
Either way, you may be interested in cross tabulation ("pivot table"). The tablefunc
module provides the crosstab()
function:
Have aggregate function group results in one row
Great question, and I did a session about this at TechEd a few years ago called Building the Fastest SQL Servers:
https://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/DBI328
In it, I explain that for data warehouses, you need storage that can provide data fast enough for SQL Server to consume it. Microsoft built a great series of white papers called the Fast Track Data Warehouse Reference Architecture that goes into hardware details, but the basic idea is that your storage needs to be able to provide 200-300MB/sec sequential read performance, per CPU core, in order to keep the CPUs busy.
The more of your data that you can cache in memory, the slower storage you can get away with. But you've got less memory than required to cache the fact tables that you're dealing with, so storage speed becomes very important.
Here's your next steps:
- Watch that video
- Test your storage with CrystalDiskMark (Here's how)
- With 4 cores, you'll want at least 800MB/sec of sequential read throughput
- If you don't have that, consider adding memory until the pain goes away (and caching the entire database in RAM isn't unthinkable)
Say you've got a 200GB database that you're dealing with, and you can't get enough storage throughput to keep your cores busy. It's not unthinkable to need not just 200GB of RAM, but even more - because after all, SSIS and SSAS really want to do their work in memory, so you have to have the engine's data available, plus work space for SSIS and SSAS.
This is also why people try to separate out SSIS and SSAS onto different VMs - they all need memory simultaneously.
Best Answer
Databases designed with the assumption that they will be entirely resident in main memory can use structures such as T-tree indexes. But the real advantage is, IMDBs are just simpler. They do less (as they don't have to worry about managing a cache, or serializing writes for consistency, or anything to do with ACID-compliant I/O at all) so they execute fewer instructions on the hardware to carry out the same "work". A general-purpose database has to be all things to all people; like a Leatherman has a dozen tools, but sometimes you just need a cutting edge, so you buy a blade from Cold Steel, and no-one debates that it is a better knife!