PostgreSQL – Upserting Data with Existing Many-to-Many Connections

postgresqlupsert

This is a strategy question – but one I've been grappling with all week and could really use some wisdom-as-a-service.

Context

I am trying to piece together a clean, actionable PostgreSQL database with sanitized records from a ton of messy, noisy raw source tables.

Say we have contacts, companies, stores. Many contacts can work at a company. Multiple companies can be involved in a store (owner, maintenance, etc.) There are many-to-many (M:N) junctions between contacts_companies and companies_stores.

Table relationships:

contacts >-< contacts_companies >-< companies >-< companies_stores >-< stores

unique on: contacts:email, companies:name, stores:formatted_address

The multiple data sources have all sorts of overlaps between the above tables. Some have just store info, some have some contact and store info, etc. As I go through each data source, data will be cleansed and then upserted to the appropriate table.

Problem

The upsert operation is the problem. Inserting with an ON CONFLICT is fine for a specific table, but what about the junctions between?

To illustrate the problem, say we have a data source that has the following records:

{owner: "Alice A.", email: "A@Alice_stores.net", store_address:"1 main st"}

This record has

  • a contact {name: Alice A, email: A@Alice_stores.net},
  • a company {name: Alice Stores}
  • a store {address: 1 main st}

The database might already have a store for 1 main st, but not a contact/company for Alice.

If we individually upsert the contact, company and store, we won't know what proper IDs to place in the junction tables:

  • If INSERT, use new id for FK in junction.
  • If Update, use existing id for fk in junction.

Question

What is the best strategy for upserting multiple tables from various sources when there are M:N relationships across the board?

One strategy that occurred to me is to query all tables/junctions, hold them in memory and perform all conditionals/updates in memory and then have a more forceful upsert. But this can quickly get out of hand…

Best Answer

One way could be using returning clause with cte to make all upserts in one statement. For illustration purposes I use 1 many-to-many relationship.

--- test tables with m2m 
create table test1(test1_id bigserial not null primary key, uq_value1 text ,constraint unique_uq_val1 unique (uq_value1));

create table test2(test2_id bigserial not null primary key, uq_value2 text ,constraint unique_uq_val2 unique (uq_value2));


create table test1_test2 (test1_id bigint not null, test2_id bigint not null, 
primary key (test1_id, test2_id),
constraint fk_test1  foreign key (test1_id) references test1(test1_id ),
constraint fk_test2  foreign key (test2_id) references test2(test2_id )
);
------------------------------------
-- insert value into the first table , or do a dummy update if it's already there
with 
ins1 as 
(insert  into test1(uq_value1) values('foo') on conflict on constraint unique_uq_val1 do update 
    set  uq_value1=test1.uq_value1
returning test1_id 
)
,
-- insert value into the second table , or do a dummy update if it's already there
ins2 as 
(insert  into test2(uq_value2) values('bar') on conflict on constraint unique_uq_val2  do update 
    set  uq_value2=test2.uq_value2 
returning test2_id 
)
,
-- select PK of inserted records (since there is exactly one record in each insert,
-- cross join is used
sel_1 as 
(
select test1_id,test2_id
from ins1  
cross join ins2 
)
-- finally insert into link tables if such a record doesn't exist : 
insert into test1_test2(test1_id, test2_id)
select * from sel_1  a 
where not exists( select null from test1_test2 b where (b.test1_id, b.test2_id) = (a.test1_id, a.test2_id))