Some pre information

Postgresql Version 11.2
GCE
CPU: 6cores
SSD: 250gig
RAM: 64gigs

We are currently deciding on how to store massive amounts (in the billions) of links/relationships between items based on how similiar they are. We run an analytical query that determins Item A is similiar to other Items of its kind. We then want to store these links for faster lookup in the future so we can say Item A is linked to Item B C E G. All of this is for analytical decision making on these items.

The item table has about 43million rows and grows every month. Thats a potential 43 million x 43 million possible relationships, so several billion links that need to be stored.

Using partitions

My first idea was to partition by hash using modulus 10 and storing each link as we will be looking up an items related rows by their ID

create table item_rels(item_id int, rel_item_id int) partition by hash(item_id);
create table item_rels_1 partition of item_rels FOR VALUES WITH (modulus 10, remainder 0);
...
create table item_rels_10 partition of item_rels FOR VALUES WITH (modulus 10, remainder 9);

create index item_rels_item_id on item_rels(item_id, rel_item_id);

This can allow the query optimizer to prune data it doesnt need to look up. Example query of give me the related ids for id 1

select item_id, array_agg(rel_item_id) from item_rels where item_id = 1 group by item_id;

Using int columns with intarray extension

Another idea is storing all of the ids in a single row and using the intarray extension and the GIN index to allow special operations like @> contains or && overlap. This goes against normalizing of the data and referential integrity.

create table item_rels_arr(
   item_id int,
   rels int[]
)

create index item_rels_arr_idx on item_rels_arr using GIN(rels, gin__int_ops)

Now we can simply query for the ids that item 1 has

select item_id, rels from item_rels_arr where item_id = 1

And we can do fancy things like what other rels, contain item 1 within its array

select item_id, rels from item_rels_arr where rels && '{1}'

Example data output

 item_id |                                               rels                                                
---------+---------------------------------------------------------------------------------------------------
    6159 | {1,533,1057,1570,2082,7071,7880805,8501204,10508802,16296233,16297449,16298534,16300221,16301098}
     533 | {1,1057,1570,2082,6159,7071,7880805,8501204,8621287,16298534,16300221,16301098}
    2082 | {1,533,1057,1570,6159,7071,7880805,8501204,8621287,16298534,16300221,16301098}
    1570 | {1,533,1057,2082,6159,7071,7880805,8501204,8621287,16298534,16300221,16301098,25460389}
    1057 | {1,533,1570,2082,6159,7071,7880805,8501204,8621287,16298534,16300221,16301098}
    7071 | {1,533,1057,1570,2082,6159,7880805,8501204,16298534,16300221,16301098}

As you can see, we found 6 items that are linked to item id 1. And for "free", we get their relationships too as maybe these ids could also be similiar.

This query plan and performance

Append  (cost=28.37..1699.40 rows=483 width=331) (actual time=0.037..0.206 rows=6 loops=1)
   Buffers: shared hit=58
   ->  Bitmap Heap Scan on public.item_rels_arr_1  (cost=28.37..167.73 rows=48 width=326) (actual time=0.036..0.036 rows=1 loops=1)
         Output: item_rels_arr_1.item_id, item_rels_arr_1.rels
         Recheck Cond: (item_rels_arr_1.rels && '{1}'::integer[])
         Heap Blocks: exact=1
         Buffers: shared hit=6
         ->  Bitmap Index Scan on item_rels_arr_1_rels_idx  (cost=0.00..28.36 rows=48 width=0) (actual time=0.031..0.031 rows=1 loops=1)
               Index Cond: (item_rels_arr_1.rels && '{1}'::integer[])
               Buffers: shared hit=5
   ->  Bitmap Heap Scan on public.item_rels_arr_2  (cost=32.38..175.27 rows=49 width=337) (actual time=0.026..0.026 rows=1 loops=1)
         Output: item_rels_arr_2.item_id, item_rels_arr_2.rels
         Recheck Cond: (item_rels_arr_2.rels && '{1}'::integer[])
         Heap Blocks: exact=1
         Buffers: shared hit=7
         ->  Bitmap Index Scan on item_rels_arr_2_rels_idx  (cost=0.00..32.37 rows=49 width=0) (actual time=0.023..0.023 rows=1 loops=1)
               Index Cond: (item_rels_arr_2.rels && '{1}'::integer[])
               Buffers: shared hit=6

               <.. OMITED OTHER PARTITIONS..>
 Planning Time: 0.433 ms
 Execution Time: 0.329 ms

Doing this same query with the normalized parition table would be somthing like this

 WITH rels AS (
    select *
    from item_rels p1
    where p1.item_id = 1
)
select item_id, array_agg(rel_item_id) as rels from item_rels
where item_id IN (select rel_item_id from rels)
group by item_id

Sort  (cost=542105.79..542153.03 rows=18897 width=36) (actual time=0.392..0.393 rows=6 loops=1)
   Output: item_rels_1.item_id, (array_agg(item_rels_1.rel_item_id))
   Sort Key: item_rels_1.item_id
   Sort Method: quicksort  Memory: 25kB
   Buffers: shared hit=49
   CTE rels
     ->  Append  (cost=0.43..78.61 rows=2230 width=8) (actual time=0.023..0.026 rows=13 loops=1)
           Buffers: shared hit=4
           ->  Index Only Scan using item_rels_1_item_id_rel_item_id_idx on public.item_rels_1 p1  (cost=0.43..67.46 rows=2230 width=8) (actual time=0.022..0.024 rows=13 loops=1)
                 Output: p1.item_id, p1.rel_item_id
                 Index Cond: (p1.item_id = 1)
                 Heap Fetches: 0
                 Buffers: shared hit=4
   ->  HashAggregate  (cost=540448.72..540684.94 rows=18897 width=36) (actual time=0.292..0.382 rows=6 loops=1)
         Output: item_rels_1.item_id, array_agg(item_rels_1.rel_item_id)
         Group Key: item_rels_1.item_id
         Buffers: shared hit=49
         ->  Nested Loop  (cost=50.61..206448.41 rows=66800062 width=8) (actual time=0.080..0.256 rows=74 loops=1)
               Output: item_rels_1.item_id, item_rels_1.rel_item_id
               Buffers: shared hit=49
               ->  HashAggregate  (cost=50.18..52.18 rows=200 width=4) (actual time=0.039..0.042 rows=13 loops=1)
                     Output: rels.rel_item_id
                     Group Key: rels.rel_item_id
                     Buffers: shared hit=4
                     ->  CTE Scan on rels  (cost=0.00..44.60 rows=2230 width=4) (actual time=0.026..0.033 rows=13 loops=1)
                           Output: rels.item_id, rels.rel_item_id
                           Buffers: shared hit=4
               ->  Append  (cost=0.43..786.81 rows=24517 width=8) (actual time=0.014..0.015 rows=6 loops=13)
                     Buffers: shared hit=45
                     ->  Index Only Scan using item_rels_1_item_id_rel_item_id_idx on public.item_rels_1  (cost=0.43..66.23 rows=2443 width=8) (actual time=0.012..0.013 rows=14 loops=1)
                           Output: item_rels_1.item_id, item_rels_1.rel_item_id
                           Index Cond: (item_rels_1.item_id = rels.rel_item_id)
                           Heap Fetches: 0
                           Buffers: shared hit=4

                           <..OMITTED PARTITIONS..>
 Planning Time: 0.703 ms
 Execution Time: 0.760 ms

So we get the same results using this kind of query on our normalized parition table, slightly higher milisceond range

Downsides

Downsides of intarray[] is that when we insert new items, we want to know of their links with existing items in our database but this would just create relationships it has, not updated its related items with its id into the array. So on insert, we would have a storm of updates to append this ID onto the opposite of the relationship determined during calculation.

Sample data size

so my first sample data set, was using 100,000 items (out of a current 43M) generated 133Million rows in my normalized parition test.

When querying for batches of ids and their rels the query mentioned above can get up into 200ms range, while the int array stayed relatively the same, near 43ms.

GraphDB?

My coworker is trying to store this stuff in Neo4j, we calculate the similarity in postgres, and store the generated rows as edges into Neo4J, however introducing another database into our stack proves other kinds of overhead and complications. And while not fully explored, I'm not sure how memory hungry a graph database may be for this, especially if each node has a lot of edges.

Question

Has anyone had experiences with storing a potential Billion+ links/relationships and could there be any consequences between either way?

My first hunch is to stick with normalized data and partition hash by item_id to prune out a lot of the initial look up with the possibility of so many rows. My only worry is this not scaling very well on a single machine when we want to target nothing more than a couple seconds at the worst.

Is a graphDB like Neo4J a better direction for us to store the final calculated links, except im skeptical because Im not convinced this is 100% a graph problem since we are not doing anything like traversing for shortest path between items (although maybe, thats for another topic)

Any advice or flaws in this would be great. My only other option is to actually generate a full realife dataset of Billions of rows to see how it would really perform which I may just do, but any suggestions or advice with storing so many rows would be great.

Edit

Decided to describe the usecase of this data a bit more specifically. We work on bacteria and fungus and run experiments through mass spec and we store that information in the database for analytical discovery.

We then calculate similarity between points of data on all experiments to help drive future discovery, so this means that all data is required and ideally in a perfect world never be deleted.

This means the data should not be archived as well. The only time data would ever be deleted is if the instrument is calibrated grossly incorrect but this should never happend. If it does, the data that needs to be removed is only a couple thousand of "data points" in the "items" table.

Best Answer

I'd expect that denormalization will lead to pain further down the road.

Do you really plan to have millions of relationships for each item? That would mean an array with millions of entries, which would not work well at all. That array has to be loaded in memory whenever you use it. Also, modifying such an array would have to rewrite the whole array.

Don't make any predictions based on unrealistically small samples. You need to generate millions of items for a useful test.

One word about partitioning: don't partition by hash. Partition so that you can delete old data easily. That's the main reason for partitioning; queries typically will become slower.

Postgresql 11 – Storing billions of relationships in normalized table paritions or intarray[] column

Using partitions

Using int columns with intarray extension

Downsides

Sample data size

GraphDB?

Question

Edit

Best Answer

Related Question

Using partitions

Using int columns with intarray extension

Downsides

Sample data size

GraphDB?

Question

Edit

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

PostgreSQL Indexing – How to Index WHERE (start_date >= ‘2013-12-15’)

Related Question