Postgresql – Optimizing query in PostgreSQL that tries to match a string and matches a timestamp range

optimizationpostgresqlpostgresql-9.4

I am building a database in PostgreSQL for financial data, where the table looks like this:

create table fin_data(
    brokerid text,
    stock int, 
    holding bigint, 
    stake float, 
    value bigint, 
    price float, 
    holding_time tstzrange, 
    unique(brokerid, stock, holding, holding_time));

This seems to already create a default index called: fin_data_brokerid_stock_holding_holding_time_key

How the data is organized: Each 'stock' has anywhere from 100-500 'brokers' holding a certain # of shares per day. I have around 95 million rows.

Example of my main query is:

select * 
from fin_data 
where brokerid='C00019' 
AND holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone;

Which basically means that for a specified broker, I want to find out how many shares a certain broker holds on a certain day.

Unfortunately, this is not a very fast query, as my explain analyze looks something like this:

Bitmap Heap Scan on fin_data  
    (cost=115762.08..1362596.53 rows=14079 width=64) 
    (actual time=571.648..1729.416 rows=1840 loops=1) 
  Recheck Cond: (brokerid = 'C00019'::text)    
  Rows Removed by Index Recheck: 116323    
Filter: (holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone)
  Rows Removed by Filter: 2054316
->  Bitmap Index Scan on fin_data_brokerid_stock_holding_holding_time_key 
  (cost=0.00..115758.56 rows=1982116 width=0) 
  (actual time=569.477..569.477 rows=2056156 loops=1)
         Index Cond: (brokerid = 'C00019'::text)
Total runtime: 1729.933 ms (8 rows)

Runtime is 1.73 seconds.

This could easily go up to 2-3 seconds for certain brokers, and is somewhat acceptable but I want it to be under 1 second if possible.

The strange thing is, when I later made an index on (stock, holding_time) only, some of those queries would be in the sub-100ms range, usually 100-300 ms. The analysis might look like:

Index Scan using stock_date_index on fin_data 
    (cost=0.57..790795.62 rows=3136 width=64) 
    (actual time=3.825..411.985 rows=644 loops=1)
    Index Cond: (stock = 5)
    Filter: (holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone)
    Rows Removed by Filter: 447426
  Total runtime: 412.123 ms (5 rows)

Which is much more desirable.

A hard fact about the data is that each stock might only have up to 500 brokers, but some brokers could have up to 1000+ stocks. Is that the only reason my queries are so much slower? How do I optimize this?

Best Answer

Default B-Tree indexes can't operate very intelligently on range data types. What you want is a GiST index on the range, so something like:

create index on fin_data using gist (holding_time);

For your main query, this would probably give a BitmapAnd plan which would use one index to get the list of row pointers satisfying broker_id and the other to get the list satisfying holding_time, and then take the intersection of the lists before visiting the table. This could be quite a bit faster. This is the most flexible case, because that gist index can be used in combination with any other indexes you have. If that is not fast enough, you could try create extension btree_gist and then making the index:

create index on fin_data using gist (broker_id, holding_time);

This would be less flexible for other query variants, but ideal for the query you identified as your main query.

Note that GiST indexes take a lot longer to initially build than B-Tree indexes. They are also are slower to maintain when new rows are inserted, so if your system struggles to keep up with insertion rate, you would want to benchmark it first.


Another idea which doesn't involve making more indexes, is to select just the columns you need rather than '*' in your queries. That might allow index-only scans to be used where they otherwise cannot be.