I am building a table that contains chemical compounds (many millions of rows) and of those compounds certain predetermined features/fragments are flagged in a fixed length bitstring.
This bitstring will have between 2000 and 20000 bits, further research needs to be done to determine more precise figures there.
When searching for compounds that have certain specific features, or lack specific features, a search is done on a select subset of this bitstring. This can be a different subset every time.
Is there an index type that can make these searches efficient in PostgreSQL (9.6 or 10)?
Insertions are infrequent and done in a batch process, whereas searches are the mostly used operation and should preferably be quick and not have false positives or false negatives.
To me this vaguely sounds like work for a GIN index, but my understanding of this index type is not enough to say for sure if that is realy the case.
Actually there could be another solution, and that is to create a separate 'fragment_index' table, with fragment identifiers (as they have a fixed position in the bitstring they also have a numerical identifier) + compound-id pairs.
My worry there is that that table can become enormous (20M compounds with avg 50 hits on fragments = 1G rows) and multiple joins against it (one for each fragment) where that join can also return up to 80% of matches with the compounds table (in some cases this is likely) will not perform at all.
I would be gratefull to get any suggestions towards a direction to get this on the road.
update: i tried a GIN index using the trigram module on a varchar array with codified shortcodes, it gave mixed results, mainly depending on the amount of data left over after the filter operations.
For the sake of giving examples that have any meaning let's assume the table looks like this:
CREATE TABLE testcompounds (
id serial primary key,
cd_structure text,
features_as_text varchar(128),
features_as_bits bit varying(32)
);
CREATE INDEX flags_testcompounds on testcompounds using gin (features_as_text gin_trgm_ops);
CREATE TABLE fragments (
id serial primary key,
smarts text,
keystring varchar(4),
frequency int
);
insert into fragments (keystring,smarts) values('AAA', '*=O');
insert into fragments (keystring,smarts) values('AAB', '[#6]=O');
insert into fragments (keystring,smarts) values('AAC', '[#7]=O');
...
insert into fragments (keystring,smarts) values('AAN', '[#6]-F');
insert into fragments (keystring,smarts) values('AAO', '[#6]-Cl');
insert into fragments (keystring,smarts) values('AAP', '[#6]-Br');
...
etc.
and the features_as_text and features_as_bits fields have been fully populated.
An example of a query that can be ran on this would be:
select id, cd_structure from testcompounds
where
(features_as_bits & (B'00000000000000000000000000000001' << (2-1)) = (B'00000000000000000000000000000001' << (2-1)))
AND (features_as_bits & (B'00000000000000000000000000000001' << (18-1)) = (B'00000000000000000000000000000000'))
AND (features_as_bits & (B'00000000000000000000000000000001' << (19-1)) = (B'00000000000000000000000000000000'))
AND (features_as_bits & (B'00000000000000000000000000000001' << (5-1)) = (B'00000000000000000000000000000000'))
AND (features_as_bits & (B'00000000000000000000000000000001' << (6-1)) = (B'00000000000000000000000000000000'))
AND (features_as_bits & (B'00000000000000000000000000000001' << (7-1)) = (B'00000000000000000000000000000000'))
AND (features_as_bits & (B'00000000000000000000000000000001' << (8-1)) = (B'00000000000000000000000000000000'))
AND (features_as_bits & (B'00000000000000000000000000000001' << (9-1)) = (B'00000000000000000000000000000000'))
In words: get everything that has feature 2, and does not have any of features 18,19,5,6,7,8,9
Best Answer
I did test some index strategies for bitwise operation. A little long procedure, sorry.
First of all,
Environment
version: PostgreSQL 10.3
host: A new instance of Amazon RDS db.m4.large (vCPU:2, Memory:8GB, Disk:GP2-SSD 100GB)
table/index definition
Create extensions
Create table
Create one btree index on each boolean column
Create one composite btree index on bool columns
Create one bloom index on bool columns
Create one bloom index on int column
Create one bloom index on bit column
Create one GIN index on int array column
Create one GIN int index on int column
Create one GIN int index on bit column
Insert
Insert
It takes long time.
analyze
index size
result
test sql
All Execution time result is taken from second query result since first query usually takes some time to load index into memory.
full-scan on int column
1 bit filtering(Parallel Seq Scan, Execution time: 122930.731 ms)
16 bit filtering(Parallel Seq Scan, 122896.131 ms)
btree index on bool column
1 bit filtering(Seq Scan, Execution time: 122853.069 ms)
16 bit filtering(Parallel Seq, Execution time: 122834.960 ms)
composite btree index on bool columns
16 bit filtering(idx_btree_on_composite_bool is used, Execution time: 293.317 ms)
bloom index on bool columns
1 bit filtering(Seq Scan, Execution time: 122726.850 ms)
16 bit filtering(idx_bloom_on_bool is used, Execution time: 373.581 ms)
bloom index on int column
1 bit filtering(Seq Scan, Execution time: 122660.620 ms)
16 bit filtering(idx_bloom_on_int is used, Execution time: 391.335 ms)
bloom index on bit column
1 bit filtering(Seq Scan, Execution time: 122434.644 ms)
16 bit filtering(idx_bloom_on_bit is used, Execution time: 397.157 ms)
GIN index on int array
1 bit filtering(idx_gin_on_int_arr is used, Execution time: 440942.779 ms)
16 bit filtering(idx_gin_on_int_arr is used, Execution time: 15322.953 ms)
GIN int index on int
1 bit filtering(idx_gin_int_on_int is used, Execution time: 489909.621 ms)
16 bit filtering(idx_gin_int_on_int is used, Execution time: 15259.772 ms)
GIN int index on bit
1 bit filtering(idx_gin_int_on_bit is used, Execution time: 506071.625 ms)
16 bit filtering(idx_gin_int_on_bit is used, Execution time: 15292.945 ms)
My thoughts