PostgreSQL Index Tuning – Using Both B-Tree and BRIN Indexes

index-tuningpostgresql

I have rather large tables (> 1000 M rows) where I have to do quick lookups. They are typically indexed with composite indexes to enable index-only scans like this CREATE INDEX table (study, analysis, gene, sample) INCLUDE (value) where a study can contain several analyses, which contain many genes, each of which contains measurements for some samples. One type of query would be to get the value for a given gene in all samples in an analysis. This would return ~500 rows. Another type of query would return values for all genes and samples in an analysis. This would return millions of rows. I wonder if it would make sense to also include a BRIN index on study, analysis to speed up such queries. It would then be redundant with the existing b-tree index. Is this bad practice that just adds overhead and complexity, or do you think it will be used by the scheduler like I want?

EDIT: I forgot to mention that the results will typically not be cached, so it would have to read from disk.

Best Answer

Your index on (study, analysis, gene, sample) INCLUDE (value) would still be usable for a query like select gene, sample, value from t where study='PNAS908' and analysis = 'the correct one'. And will probably be better than BRIN, or at last not worse by enough to matter.

Note that if "analysis" is just labelled in a low-entropy way within a study (a, b, c, 1, 2, 3, log, linear, sqrt, intervention, SOC, placebo, ...), such that the same analysis identifier is reused from study to study, then the multi-column BRIN index is unlikely to do what you want. The columns in a BRIN index are not hierarchical like they are within a btree index, they are each independent.

Some decision guidance

Aggregates (min/max/avg/...) on a single metric-type? -> "star"
Aggregates considering all or many different metrics? -> "EAV"

Do these attributes describe a common entity? -> "star"
Or is it just a bunch of numbers that may be grouped / split up any other way? -> "EAV"

Your data is written once and then never changed? -> "star"
Or do you run UPDATES on selected metrics? -> "EAV"

Is your set of 500 metrics complete? The set hardly ever or never changes? -> "star"
New metrics are added, existing metrics are dropped all the time? -> "EAV"

Storage size

Concerning your comment:

Storage is less important now for optimization, we are focusing on query times.

Storage size is a major factor for query times. The number of data pages that have to be read to satisfy a query is probably the most important single factor for performance.

Let's start with your casual remark: Data type is int or double.

int occupies 4 bytes.
double occupies 8 bytes.

Assuming all columns are NOT NULL, 500 integer columns, plus 1 timestamp plus row overhead (no padding) would occupy 2036 bytes in the "star" schema. No compression possible. Here is how you calculate that:
Configuring PostgreSQL for read performance

If you mix int with double, be sure not to waste space for padding. For instance, group integer and double metrics.

In the "EAV" model, you'd need at least 44 or 52 bytes per row. 22000 or 26000 bytes for one timestamp. 11 - 13 times as much. That matters. For one hour's worth of data, you need to fetch 2000 data pages or more (default page size 8k) vs. around 180 pages for the "star" schema.

Here are some tools to measure size:
Measure the size of a PostgreSQL table row

I think storage size can be the key to performance here. If you are focusing on query times, and I had to make a wild guess, the "star" schema is probably the better choice.

But as I said, it depends on a lot of details.

Either way, you may be interested in cross tabulation ("pivot table"). The tablefunc module provides the crosstab() function:
Have aggregate function group results in one row

PostgreSQL Partial Indexes – Using Multiple Indexes per Query

Very short version: Yes, sometimes.

PostgreSQL can use bitmap index scans to combine multiple indexes.

A predicate like

WHERE a > 50 AND a < 50000

is a specialisation of the more general form:

wHERE a > 50 and b < 50000

for a = b.

PostgreSQL can use two indexes here, one for each part of the predicate, and then bitmap AND them. It doesn't matter if they happen to be on different ranges of the same column.

This is much less efficient than a single index, and may not be useful for some queries, but it's possible.

The bigger problem is that PostgreSQL's partial index support is not very bright. Irrespective of whether there's one or two indexes it might just not figure out that it can use the index at all.

Demonstration setup:

CREATE TABLE partial (x integer, y integer);
CREATE INDEX xs_above_50 ON partial(x) WHERE (x > 50);
CREATE INDEX xs_below_50000 ON partial(x) WHERE (x < 5000);
INSERT INTO partial(x,y) SELECT a, a FROM generate_series(1,100000) a;

OK, what will Pg prefer for given queries?

regress=> EXPLAIN SELECT y FROM partial WHERE x > 50 AND x < 50000;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Index Scan using xs_above_50 on partial  (cost=0.29..1788.47 rows=50309 width=4)
   Index Cond: ((x > 50) AND (x < 50000))
(2 rows)

regress=> EXPLAIN SELECT y FROM partial WHERE x > 20 AND x < 50000;
                          QUERY PLAN                          
--------------------------------------------------------------
 Seq Scan on partial  (cost=0.00..1943.00 rows=50339 width=4)
   Filter: ((x > 20) AND (x < 50000))
(2 rows)

regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 50000;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Index Scan using xs_above_50 on partial  (cost=0.29..1787.45 rows=50258 width=4)
   Index Cond: ((x > 100) AND (x < 50000))
(2 rows)

regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 20000;
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Index Scan using xs_above_50 on partial  (cost=0.29..710.71 rows=19921 width=4)
   Index Cond: ((x > 100) AND (x < 20000))
(2 rows)

What if we try to force a bitmap index scan just to find out if Pg can use one, even if it's not worth doing for this particular simple case and small sample?

Try:

regress=> SET enable_seqscan  = off;
SET
regress=> SET enable_indexscan  = off;
SET
regress=> SET enable_indexonlyscan  = off;
SET
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 20000;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Bitmap Heap Scan on partial  (cost=424.48..1166.30 rows=19921 width=4)
   Recheck Cond: ((x > 100) AND (x < 20000))
   ->  Bitmap Index Scan on xs_above_50  (cost=0.00..419.50 rows=19921 width=0)
         Index Cond: ((x > 100) AND (x < 20000))
(4 rows)

Hm. Nope. Not combining the indexes there. It might be able to but simply not think it's worth scanning a second index, though.

What about a query that ORs two predicates instead?

regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 OR x < 200;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Bitmap Heap Scan on partial  (cost=1905.29..3848.29 rows=99908 width=4)
   Recheck Cond: ((x > 100) OR (x < 200))
   ->  BitmapOr  (cost=1905.29..1905.29 rows=100000 width=0)
         ->  Bitmap Index Scan on xs_above_50  (cost=0.00..1849.60 rows=99908 width=0)
               Index Cond: (x > 100)
         ->  Bitmap Index Scan on xs_below_50000  (cost=0.00..5.73 rows=193 width=0)
               Index Cond: (x < 200)
(7 rows)

Here PostgreSQL has ORed both indexes to find a match, then done a heap scan and recheck.

So yes, PostgreSQL can combine multiple partial indexes, at least for some queries, where it is useful to do so.

But if I RESET the planner overrides...

regress=> RESET enable_seqscan;
RESET
regress=> RESET enable_indexscan ;
RESET
regress=> RESET enable_indexonlyscan ;
RESET
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 OR x < 200;
                          QUERY PLAN                          
--------------------------------------------------------------
 Seq Scan on partial  (cost=0.00..1943.00 rows=99908 width=4)
   Filter: ((x > 100) OR (x < 200))
(2 rows)

... Pg will realise it's faster to just sequentially scan the table.

Best Answer

Related Solutions

Postgresql – Schema for analytics table in Postgres

Some decision guidance

Storage size

PostgreSQL Partial Indexes – Using Multiple Indexes per Query

Related Question