As commented, arguments from both sides are valid. Let's call them "star" (the flattened schema of your DBA) and "EAV" (entity-attribute-value). The latter can serve as a hint. Details in this related answer:
Is there a name for this database structure?
Well, if your 500 metrics are of well known type and you don't invent new ones / drop old ones all the time, it's not the worst case EAV scenario, but similar.
There is no "standard" way to do it. The more flexible approach would be the "EAV" schema. Then you can easily add and delete metric-types: add a row to the referenced metric_type
table, or delete one (cascading to metrics table). That would need a schema-change and be more expensive in the "star" model.
You can do quite a bit for either model with smart covering / partial / multicolumn indexes.
Some decision guidance
Aggregates (min/max/avg/...) on a single metric-type? -> "star"
Aggregates considering all or many different metrics? -> "EAV"
Do these attributes describe a common entity? -> "star"
Or is it just a bunch of numbers that may be grouped / split up any other way? -> "EAV"
Your data is written once and then never changed? -> "star"
Or do you run UPDATES on selected metrics? -> "EAV"
Is your set of 500 metrics complete? The set hardly ever or never changes? -> "star"
New metrics are added, existing metrics are dropped all the time? -> "EAV"
Storage size
Concerning your comment:
Storage is less important now for optimization, we are focusing on query times.
Storage size is a major factor for query times. The number of data pages that have to be read to satisfy a query is probably the most important single factor for performance.
Let's start with your casual remark: Data type is int or double
.
int
occupies 4 bytes.
double
occupies 8 bytes.
Assuming all columns are NOT NULL
, 500 integer columns, plus 1 timestamp plus row overhead (no padding) would occupy 2036 bytes in the "star" schema. No compression possible. Here is how you calculate that:
Configuring PostgreSQL for read performance
If you mix int
with double
, be sure not to waste space for padding. For instance, group integer and double metrics.
In the "EAV" model, you'd need at least 44 or 52 bytes per row. 22000 or 26000 bytes for one timestamp. 11 - 13 times as much. That matters. For one hour's worth of data, you need to fetch 2000 data pages or more (default page size 8k) vs. around 180 pages for the "star" schema.
Here are some tools to measure size:
Measure the size of a PostgreSQL table row
I think storage size can be the key to performance here. If you are focusing on query times
, and I had to make a wild guess, the "star" schema is probably the better choice.
But as I said, it depends on a lot of details.
Either way, you may be interested in cross tabulation ("pivot table"). The tablefunc
module provides the crosstab()
function:
Have aggregate function group results in one row
Very short version: Yes, sometimes.
PostgreSQL can use bitmap index scans to combine multiple indexes.
A predicate like
WHERE a > 50 AND a < 50000
is a specialisation of the more general form:
wHERE a > 50 and b < 50000
for a = b.
PostgreSQL can use two indexes here, one for each part of the predicate, and then bitmap AND
them. It doesn't matter if they happen to be on different ranges of the same column.
This is much less efficient than a single index, and may not be useful for some queries, but it's possible.
The bigger problem is that PostgreSQL's partial index support is not very bright. Irrespective of whether there's one or two indexes it might just not figure out that it can use the index at all.
Demonstration setup:
CREATE TABLE partial (x integer, y integer);
CREATE INDEX xs_above_50 ON partial(x) WHERE (x > 50);
CREATE INDEX xs_below_50000 ON partial(x) WHERE (x < 5000);
INSERT INTO partial(x,y) SELECT a, a FROM generate_series(1,100000) a;
OK, what will Pg prefer for given queries?
regress=> EXPLAIN SELECT y FROM partial WHERE x > 50 AND x < 50000;
QUERY PLAN
----------------------------------------------------------------------------------
Index Scan using xs_above_50 on partial (cost=0.29..1788.47 rows=50309 width=4)
Index Cond: ((x > 50) AND (x < 50000))
(2 rows)
regress=> EXPLAIN SELECT y FROM partial WHERE x > 20 AND x < 50000;
QUERY PLAN
--------------------------------------------------------------
Seq Scan on partial (cost=0.00..1943.00 rows=50339 width=4)
Filter: ((x > 20) AND (x < 50000))
(2 rows)
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 50000;
QUERY PLAN
----------------------------------------------------------------------------------
Index Scan using xs_above_50 on partial (cost=0.29..1787.45 rows=50258 width=4)
Index Cond: ((x > 100) AND (x < 50000))
(2 rows)
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 20000;
QUERY PLAN
---------------------------------------------------------------------------------
Index Scan using xs_above_50 on partial (cost=0.29..710.71 rows=19921 width=4)
Index Cond: ((x > 100) AND (x < 20000))
(2 rows)
What if we try to force a bitmap index scan just to find out if Pg can use one, even if it's not worth doing for this particular simple case and small sample?
Try:
regress=> SET enable_seqscan = off;
SET
regress=> SET enable_indexscan = off;
SET
regress=> SET enable_indexonlyscan = off;
SET
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 20000;
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on partial (cost=424.48..1166.30 rows=19921 width=4)
Recheck Cond: ((x > 100) AND (x < 20000))
-> Bitmap Index Scan on xs_above_50 (cost=0.00..419.50 rows=19921 width=0)
Index Cond: ((x > 100) AND (x < 20000))
(4 rows)
Hm. Nope. Not combining the indexes there. It might be able to but simply not think it's worth scanning a second index, though.
What about a query that ORs two predicates instead?
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 OR x < 200;
QUERY PLAN
---------------------------------------------------------------------------------------
Bitmap Heap Scan on partial (cost=1905.29..3848.29 rows=99908 width=4)
Recheck Cond: ((x > 100) OR (x < 200))
-> BitmapOr (cost=1905.29..1905.29 rows=100000 width=0)
-> Bitmap Index Scan on xs_above_50 (cost=0.00..1849.60 rows=99908 width=0)
Index Cond: (x > 100)
-> Bitmap Index Scan on xs_below_50000 (cost=0.00..5.73 rows=193 width=0)
Index Cond: (x < 200)
(7 rows)
Here PostgreSQL has ORed both indexes to find a match, then done a heap scan and recheck.
So yes, PostgreSQL can combine multiple partial indexes, at least for some queries, where it is useful to do so.
But if I RESET
the planner overrides...
regress=> RESET enable_seqscan;
RESET
regress=> RESET enable_indexscan ;
RESET
regress=> RESET enable_indexonlyscan ;
RESET
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 OR x < 200;
QUERY PLAN
--------------------------------------------------------------
Seq Scan on partial (cost=0.00..1943.00 rows=99908 width=4)
Filter: ((x > 100) OR (x < 200))
(2 rows)
... Pg will realise it's faster to just sequentially scan the table.
Best Answer
Your index on
(study, analysis, gene, sample) INCLUDE (value)
would still be usable for a query likeselect gene, sample, value from t where study='PNAS908' and analysis = 'the correct one'
. And will probably be better than BRIN, or at last not worse by enough to matter.Note that if "analysis" is just labelled in a low-entropy way within a study (a, b, c, 1, 2, 3, log, linear, sqrt, intervention, SOC, placebo, ...), such that the same analysis identifier is reused from study to study, then the multi-column BRIN index is unlikely to do what you want. The columns in a BRIN index are not hierarchical like they are within a btree index, they are each independent.