Postgresql – Find friends of friends (recursively) efficiently using Postgresql

graphhierarchypostgresql

Objective: Users submit their Contact Books, and then the application looks for connections between users, according to their Phone Number. Something like "6 Handshakes" idea (https://en.wikipedia.org/wiki/Six_degrees_of_separation).

Problem: Make this query performance close to real time. When the User submits his phone number and gets the full list of other phones, he may know. Plain list – without connections (graph vertices etc), but full, not paginated (this requirement is here because original goal is more complex).

Question: is it possible to achieve close-to-realtime performance with pure relational database, without Graph databases (Neo4j, etc), graph extensions (bitnine agensgraph) or workflow redesign? Any denormalization is possible, but to my understanding, it won't help.

Given:

test=# select * from connections;
 user_phone | friend_phone
------------+--------------
          1 |            2
          1 |            3
          1 |            4
          2 |            6
          2 |            7
          2 |            8
          8 |           10
          8 |           11
          8 |           12
         20 |           30
         40 |           50
         60 |           70

I expect to receive the following connections for User with Phone === 1:

 friend_phone
--------------
            2
            3
            4
            6
            7
            8
           10
           11
           12
(9 rows)

It is really difficult to estimate real-world connections numbers. But I was testing at least with:

  • 10,000 Users (Phone Numbers)
  • Each user was randomly assigned with 50-1000 Connections with pseudo-random other Users
  • This resulted in about 1,000,000 Connections

If it is impossible to achieve this in general (using some tricky ORDER BYs or subqueries etc) – what metrics should be considered for example to understand that:

with 1,000,000 connections you need 128GB RAM instance to get 2 seconds response time

and

for 100,000,000 connections you need 1TB RAM instance to get 5 seconds response time

?

P.S. I tried subqueries, CTEs, JOINs, but eventually I've found that WITH RECURSIVE is the most explicit way, and it has the same resulting time as other approaches.

This is the table:

CREATE TABLE connections (
  user_phone bigint NOT NULL,
  friend_phone bigint NOT NULL
);

This is how I seed the data:

INSERT INTO connections(user_phone, friend_phone) (
  SELECT generate_series AS user_phone, generate_series(1, (random()*5000)::int) AS friend_phone from generate_series(1, 500) ORDER BY user_phone
);

I've created some indexes:

test=# \d connections
               Table "public.connections"
    Column    |  Type  | Collation | Nullable | Default
--------------+--------+-----------+----------+---------
 user_phone   | bigint |           | not null |
 friend_phone | bigint |           | not null |
Indexes:
    "connections_user_phone_friend_phone_idx" UNIQUE, btree (user_phone, friend_phone)
    "connections_friend_phone_idx" btree (friend_phone)
    "connections_friend_phone_user_phone_idx" btree (friend_phone, user_phone)
    "connections_user_phone_idx" btree (user_phone)

I expect friend_phones.count >>> user_phones.count, so index connections(friend_phones, user_phones) seems to be the most appropriate, but I've created all 4 indexes during testing.

2270852 connections records were generated. Then I run this query:

WITH RECURSIVE p AS (
    SELECT friend_phone FROM connections WHERE connections.user_phone = 1
    
    UNION

    SELECT friends.friend_phone FROM connections AS friends JOIN p ON friends.user_phone = p.friend_phone
  )
SELECT COUNT(friend_phone) FROM p;

It returns 5002 records and executes for about 3 seconds. EXPLAIN ANALYZE looks like the following:

                                                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3111105.00..3111105.01 rows=1 width=8) (actual time=3151.781..3151.781 rows=1 loops=1)
   CTE p
     ->  Recursive Union  (cost=0.43..2146207.20 rows=42884347 width=8) (actual time=0.060..3148.294 rows=5002 loops=1)
           ->  Index Scan using connections_user_phone_idx on connections  (cost=0.43..3003.69 rows=4617 width=8) (actual time=0.055..2.024 rows=4137 loops=1)
                 Index Cond: (user_phone = 1)
           ->  Merge Join  (cost=4500.77..128551.66 rows=4287973 width=8) (actual time=768.577..1359.598 rows=635428 loops=2)
                 Merge Cond: (friends.user_phone = p_1.friend_phone)
                 ->  Index Scan using connections_user_phone_idx on connections friends  (cost=0.43..54054.59 rows=2270852 width=16) (actual time=0.013..793.467 rows=1722262 loops=2)
                 ->  Sort  (cost=4500.34..4615.77 rows=46170 width=8) (actual time=0.765..74.850 rows=637677 loops=2)
                       Sort Key: p_1.friend_phone
                       Sort Method: quicksort  Memory: 65kB
                       ->  WorkTable Scan on p p_1  (cost=0.00..923.40 rows=46170 width=8) (actual time=0.001..0.314 rows=2501 loops=2)
   ->  CTE Scan on p  (cost=0.00..857686.94 rows=42884347 width=8) (actual time=0.062..3150.755 rows=5002 loops=1)
 Planning Time: 0.409 ms
 Execution Time: 3152.412 ms
(15 rows)

I feel like I'm missing something, because, even if many loops are required, it is a finite number of connections, for each user, which is greatly less than the total amount of connections (in the example above – 5000 user connections against 2,2M connections overall ~ 0,25%). Maybe some specific index type? Maybe some other tricks I don't know about?

Thanks in advance for reading such a big question 🙂

P.P.S, used Postgres 12 with next postgresql.conf:

# DB Version: 12
# OS Type: mac
# DB Type: web
# Total Memory (RAM): 16 GB
# CPUs num: 8
# Data Storage: ssd

max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
work_mem = 5242kB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4

Best Answer

Just by looking at the explain plan, we see that the optimizer's estimate are totally off the grid:

  • the sort node is estimated at 46170 rows whereas there are actually 637677 rows
  • the index scan is estimated at 2270852 rows whereas there are actually 1722262 rows

And so on.

So I think you'll need to vacuum analyze your tables connections and friends before getting a new plan.

Maybe, it will only help a little, but we need to get rid of this problem before trying to understand what's happening and how to optimize that query in your context.