PostgreSQL most efficient way to reference multiple tables

database-designperformancepostgresqlpostgresql-11postgresql-9.6postgresql-performance

I am looking for the most efficient way to reference multiple tables in one table, when there is only one reference possible at a time. Which means tables A and B are referenced by table C, but both A and B cannot be referenced in a single row in C, and I choose which one I look into when I write my query.
I have thought about 4 ways of doing this :

  • Having a column type and a column FK so I can make joins like type = 'A' AND a.pk = c.a_fk (this is what we use right now)
  • Using inheritance : Both A and B inherit a sequence from a "parent" table and C has a foreign key to the "parent" table, but this is not possible with PostgreSQL
  • Using the same principle as the inheritance but with another table D. Both A and B have a reference to D, C also has a reference to D, and I can do a join like FROM c JOIN a ON a.d_fk = c.d_fk
  • Using a column by table I want to have a foreign key to

In every solution I tried, the query planner is wrong about how many rows will be returned.
Here is an example : I created a simple database where I have 3 tables like so :

                            Table "public.a"
 Column |  Type  | Collation | Nullable |            Default            
--------+--------+-----------+----------+-------------------------------
 pk     | bigint |           | not null | nextval('a_pk_seq'::regclass)

                            Table "public.b"
 Column |  Type  | Collation | Nullable |            Default            
--------+--------+-----------+----------+-------------------------------
 pk     | bigint |           | not null | nextval('b_pk_seq'::regclass)

                           Table "public.c"
 Column |  Type  | Collation | Nullable |            Default            
--------+--------+-----------+----------+-------------------------------
 pk     | bigint |           | not null | nextval('c_pk_seq'::regclass)
 a_fk   | bigint |           |          | 
 b_fk   | bigint |           |          | 
Indexes:
    "c_pkey" PRIMARY KEY, btree (pk)
    "c_a_fk_idx" btree (a_fk)
    "c_b_fk_idx" btree (b_fk)
Foreign-key constraints:
    "c_a_fk_fkey" FOREIGN KEY (a_fk) REFERENCES a(pk)
    "c_b_fk_fkey" FOREIGN KEY (b_fk) REFERENCES b(pk)
  • A is filled with 20 rows, B with 10000.
  • C is also filled with 10000, both a_fk and b_fk cannot be filled at the same time. All rows are filled with b_fk (which must be unique) except for 5 rows where a_fk is filled (which must be unique too).

So when I run this query :

SELECT * FROM a JOIN c ON a.pk = c.a_fk;

I would except the planner to think the query will return 5 rows.
Instead, this is what I got :

 Merge Join  (cost=1.76..1.94 rows=10000 width=32) (actual time=0.035..0.051 rows=5 loops=1)
   Merge Cond: (a.pk = c.a_fk)
   ->  Sort  (cost=1.63..1.68 rows=20 width=8) (actual time=0.024..0.028 rows=15 loops=1)
         Sort Key: a.pk
         Sort Method: quicksort  Memory: 25kB
         ->  Seq Scan on a  (cost=0.00..1.20 rows=20 width=8) (actual time=0.009..0.013 rows=20 loops=1)
   ->  Index Scan using c_a_fk_idx1 on c  (cost=0.13..24.21 rows=10000 width=24) (actual time=0.006..0.012 rows=5 loops=1)
 Planning time: 0.297 ms
 Execution time: 0.088 ms

I tried to create a partial index CREATE UNIQUE INDEX ON c (b_fk) WHERE a_fk IS NULL but it didn't change anything.

The only thing that improved the plan was to add a NULL check on b_fk like this :

SELECT * FROM a JOIN c ON a.pk = c.a_fk WHERE c.a_fk IS NOT NULL;

Which gives this plan :

 Merge Join  (cost=1.92..6.05 rows=5 width=32) (actual time=0.031..0.046 rows=5 loops=1)
   Merge Cond: (c.a_fk = a.pk)
   ->  Index Scan using c_a_fk_idx on c  (cost=0.29..20.37 rows=5 width=24) (actual time=0.005..0.012 rows=5 loops=1)
         Index Cond: (a_fk IS NOT NULL)
   ->  Sort  (cost=1.63..1.68 rows=20 width=8) (actual time=0.022..0.024 rows=15 loops=1)
         Sort Key: a.pk
         Sort Method: quicksort  Memory: 25kB
         ->  Seq Scan on a  (cost=0.00..1.20 rows=20 width=8) (actual time=0.006..0.007 rows=20 loops=1)
 Planning time: 0.559 ms
 Execution time: 0.086 ms

But it still gets it wrong for the rows in A and I think there must be another way to help the query planner, one that doesn't involve adding a "useless" NULL check for every column that should be NULL.

I tried on both Postgres 9.6 and 11, I would prefer a solution for 9.6 because this is what we use but I would accept a solution for 11.

Here is a real life example using the "type" solution :

SELECT *
  FROM BinDescription b
  JOIN RobotDescription r
    ON b.physicalContainerType = 'ROBOT'
   AND b.physicalContainerFK = r.robotPK
  JOIN RobotMissionDescription rm
    ON rm.robotFK = r.robotPK
;

Which gives me this plan :

 Merge Join  (cost=225.12..395.31 rows=12026 width=301) (actual time=4.372..67.206 rows=47293 loops=1)
   Merge Cond: (b.physicalcontainerfk = r.robotpk)
   ->  Nested Loop  (cost=222.91..15184.24 rows=88 width=152) (actual time=4.321..44.448 rows=47293 loops=1)
         ->  Index Scan using bindescription_physicalcontainertype_physicalcontainerfk_key on bindescription b  (cost=0.29..13.66 rows=4 width=76) (actual time=0.011..0.017 rows=4 loops=1)
               Index Cond: (physicalcontainertype = 'ROBOT'::physicalcontainertype)
         ->  Bitmap Heap Scan on robotmissiondescription rm  (cost=222.62..3672.39 rows=12026 width=76) (actual time=2.431..7.985 rows=11823 loops=4)
               Recheck Cond: (robotfk = b.physicalcontainerfk)
               Heap Blocks: exact=14246
               ->  Bitmap Index Scan on idx_robotmissiondescription_robotfk  (cost=0.00..219.62 rows=12026 width=0) (actual time=1.629..1.629 rows=11823 loops=4)
                     Index Cond: (robotfk = b.physicalcontainerfk)
   ->  Sort  (cost=2.20..2.29 rows=34 width=149) (actual time=0.044..4.752 rows=47321 loops=1)
         Sort Key: r.robotpk
         Sort Method: quicksort  Memory: 34kB
         ->  Seq Scan on robotdescription r  (cost=0.00..1.34 rows=34 width=149) (actual time=0.007..0.014 rows=34 loops=1)
 Planning time: 1.559 ms
 Execution time: 69.480 ms

Best Answer

Both A and B inherit a sequence from a "parent" table and C has a foreign key to the "parent" table, but this is not possible with PostgreSQL

This is possible with any SQL database. Why do you think it's not possible with PostgreSQL?

You're probably confusing SQL Table Inheritance with PostgreSQL's "table inheritance", which are different things

Here's how to do it in SQL. This is called Class Table Inheritance:

create table parties (
  party_id serial primary key
);

create table individuals (
  party_id int primary key references parties(party_id),
  given_name text,
  ...
);

create table organizations (
  party_id int primary key references parties(party_id),
  organization_name text,
  ...
);

create table sales_orders (
  order_id serial primary key,
  customer_id int references parties(party_id),
  ...
);

If you want faster performance, use Single Table Inheritance:

create table parties (
  party_id serial primary key,
  party_type text check(party_type in ('Individual','Organization')),
  given_name text null,
  organization_name text null,
  ...
  /* add some check constraints for party_type vs values for individuals/orgs */
);

Also, it'd be about 100x easier to help you if you gave real examples instead of abstract "A" and "B"