Postgresql – Find friends of friends (recursively) efficiently using Postgresql

Objective: Users submit their Contact Books, and then the application looks for connections between users, according to their Phone Number. Something like "6 Handshakes" idea (https://en.wikipedia.org/wiki/Six_degrees_of_separation).

Problem: Make this query performance close to real time. When the User submits his phone number and gets the full list of other phones, he may know. Plain list – without connections (graph vertices etc), but full, not paginated (this requirement is here because original goal is more complex).

Question: is it possible to achieve close-to-realtime performance with pure relational database, without Graph databases (Neo4j, etc), graph extensions (bitnine agensgraph) or workflow redesign? Any denormalization is possible, but to my understanding, it won't help.

Given:

test=# select * from connections;
 user_phone | friend_phone
------------+--------------
          1 |            2
          1 |            3
          1 |            4
          2 |            6
          2 |            7
          2 |            8
          8 |           10
          8 |           11
          8 |           12
         20 |           30
         40 |           50
         60 |           70

I expect to receive the following connections for User with Phone === 1:

 friend_phone
--------------
            2
            3
            4
            6
            7
            8
           10
           11
           12
(9 rows)

It is really difficult to estimate real-world connections numbers. But I was testing at least with:

10,000 Users (Phone Numbers)
Each user was randomly assigned with 50-1000 Connections with pseudo-random other Users
This resulted in about 1,000,000 Connections

If it is impossible to achieve this in general (using some tricky ORDER BYs or subqueries etc) – what metrics should be considered for example to understand that:

with 1,000,000 connections you need 128GB RAM instance to get 2 seconds response time

and

for 100,000,000 connections you need 1TB RAM instance to get 5 seconds response time

P.S. I tried subqueries, CTEs, JOINs, but eventually I've found that WITH RECURSIVE is the most explicit way, and it has the same resulting time as other approaches.

This is the table:

CREATE TABLE connections (
  user_phone bigint NOT NULL,
  friend_phone bigint NOT NULL
);

This is how I seed the data:

INSERT INTO connections(user_phone, friend_phone) (
  SELECT generate_series AS user_phone, generate_series(1, (random()*5000)::int) AS friend_phone from generate_series(1, 500) ORDER BY user_phone
);

I've created some indexes:

test=# \d connections
               Table "public.connections"
    Column    |  Type  | Collation | Nullable | Default
--------------+--------+-----------+----------+---------
 user_phone   | bigint |           | not null |
 friend_phone | bigint |           | not null |
Indexes:
    "connections_user_phone_friend_phone_idx" UNIQUE, btree (user_phone, friend_phone)
    "connections_friend_phone_idx" btree (friend_phone)
    "connections_friend_phone_user_phone_idx" btree (friend_phone, user_phone)
    "connections_user_phone_idx" btree (user_phone)

I expect friend_phones.count >>> user_phones.count, so index connections(friend_phones, user_phones) seems to be the most appropriate, but I've created all 4 indexes during testing.

2270852 connections records were generated. Then I run this query:

WITH RECURSIVE p AS (
    SELECT friend_phone FROM connections WHERE connections.user_phone = 1
    
    UNION

    SELECT friends.friend_phone FROM connections AS friends JOIN p ON friends.user_phone = p.friend_phone
  )
SELECT COUNT(friend_phone) FROM p;

It returns 5002 records and executes for about 3 seconds. EXPLAIN ANALYZE looks like the following:

                                                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3111105.00..3111105.01 rows=1 width=8) (actual time=3151.781..3151.781 rows=1 loops=1)
   CTE p
     ->  Recursive Union  (cost=0.43..2146207.20 rows=42884347 width=8) (actual time=0.060..3148.294 rows=5002 loops=1)
           ->  Index Scan using connections_user_phone_idx on connections  (cost=0.43..3003.69 rows=4617 width=8) (actual time=0.055..2.024 rows=4137 loops=1)
                 Index Cond: (user_phone = 1)
           ->  Merge Join  (cost=4500.77..128551.66 rows=4287973 width=8) (actual time=768.577..1359.598 rows=635428 loops=2)
                 Merge Cond: (friends.user_phone = p_1.friend_phone)
                 ->  Index Scan using connections_user_phone_idx on connections friends  (cost=0.43..54054.59 rows=2270852 width=16) (actual time=0.013..793.467 rows=1722262 loops=2)
                 ->  Sort  (cost=4500.34..4615.77 rows=46170 width=8) (actual time=0.765..74.850 rows=637677 loops=2)
                       Sort Key: p_1.friend_phone
                       Sort Method: quicksort  Memory: 65kB
                       ->  WorkTable Scan on p p_1  (cost=0.00..923.40 rows=46170 width=8) (actual time=0.001..0.314 rows=2501 loops=2)
   ->  CTE Scan on p  (cost=0.00..857686.94 rows=42884347 width=8) (actual time=0.062..3150.755 rows=5002 loops=1)
 Planning Time: 0.409 ms
 Execution Time: 3152.412 ms
(15 rows)

I feel like I'm missing something, because, even if many loops are required, it is a finite number of connections, for each user, which is greatly less than the total amount of connections (in the example above – 5000 user connections against 2,2M connections overall ~ 0,25%). Maybe some specific index type? Maybe some other tricks I don't know about?

Thanks in advance for reading such a big question 🙂

P.P.S, used Postgres 12 with next postgresql.conf:

# DB Version: 12
# OS Type: mac
# DB Type: web
# Total Memory (RAM): 16 GB
# CPUs num: 8
# Data Storage: ssd

max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
work_mem = 5242kB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4

Best Answer

Just by looking at the explain plan, we see that the optimizer's estimate are totally off the grid:

the sort node is estimated at 46170 rows whereas there are actually 637677 rows
the index scan is estimated at 2270852 rows whereas there are actually 1722262 rows

And so on.

So I think you'll need to vacuum analyze your tables connections and friends before getting a new plan.

Maybe, it will only help a little, but we need to get rid of this problem before trying to understand what's happening and how to optimize that query in your context.

Best Answer

Related Solutions

PostgreSQL Indexing – How to Index WHERE (start_date >= ‘2013-12-15’)

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question