PostgreSQL – Upserting Data with Existing Many-to-Many Connections

postgresqlupsert

This is a strategy question – but one I've been grappling with all week and could really use some wisdom-as-a-service.

Context

I am trying to piece together a clean, actionable PostgreSQL database with sanitized records from a ton of messy, noisy raw source tables.

Say we have contacts, companies, stores. Many contacts can work at a company. Multiple companies can be involved in a store (owner, maintenance, etc.) There are many-to-many (M:N) junctions between contacts_companies and companies_stores.

Table relationships:

contacts >-< contacts_companies >-< companies >-< companies_stores >-< stores

unique on: contacts:email, companies:name, stores:formatted_address

The multiple data sources have all sorts of overlaps between the above tables. Some have just store info, some have some contact and store info, etc. As I go through each data source, data will be cleansed and then upserted to the appropriate table.

Problem

The upsert operation is the problem. Inserting with an ON CONFLICT is fine for a specific table, but what about the junctions between?

To illustrate the problem, say we have a data source that has the following records:

{owner: "Alice A.", email: "A@Alice_stores.net", store_address:"1 main st"}

This record has

a contact {name: Alice A, email: A@Alice_stores.net},
a company {name: Alice Stores}
a store {address: 1 main st}

The database might already have a store for 1 main st, but not a contact/company for Alice.

If we individually upsert the contact, company and store, we won't know what proper IDs to place in the junction tables:

If INSERT, use new id for FK in junction.
If Update, use existing id for fk in junction.

Question

What is the best strategy for upserting multiple tables from various sources when there are M:N relationships across the board?

One strategy that occurred to me is to query all tables/junctions, hold them in memory and perform all conditionals/updates in memory and then have a more forceful upsert. But this can quickly get out of hand…

Best Answer

One way could be using returning clause with cte to make all upserts in one statement. For illustration purposes I use 1 many-to-many relationship.

--- test tables with m2m 
create table test1(test1_id bigserial not null primary key, uq_value1 text ,constraint unique_uq_val1 unique (uq_value1));

create table test2(test2_id bigserial not null primary key, uq_value2 text ,constraint unique_uq_val2 unique (uq_value2));


create table test1_test2 (test1_id bigint not null, test2_id bigint not null, 
primary key (test1_id, test2_id),
constraint fk_test1  foreign key (test1_id) references test1(test1_id ),
constraint fk_test2  foreign key (test2_id) references test2(test2_id )
);
------------------------------------
-- insert value into the first table , or do a dummy update if it's already there
with 
ins1 as 
(insert  into test1(uq_value1) values('foo') on conflict on constraint unique_uq_val1 do update 
    set  uq_value1=test1.uq_value1
returning test1_id 
)
,
-- insert value into the second table , or do a dummy update if it's already there
ins2 as 
(insert  into test2(uq_value2) values('bar') on conflict on constraint unique_uq_val2  do update 
    set  uq_value2=test2.uq_value2 
returning test2_id 
)
,
-- select PK of inserted records (since there is exactly one record in each insert,
-- cross join is used
sel_1 as 
(
select test1_id,test2_id
from ins1  
cross join ins2 
)
-- finally insert into link tables if such a record doesn't exist : 
insert into test1_test2(test1_id, test2_id)
select * from sel_1  a 
where not exists( select null from test1_test2 b where (b.test1_id, b.test2_id) = (a.test1_id, a.test2_id))

Related Solutions

Postgresql – How to design a database for continents, countries, regions, cities and POIs

i suggest 2 options

Option 1: if you want a RBDMs, i think 2 tables will be sufficient

Table 1 (Lookup)
-LocationID
-Continent
-country
-region
-City
-Landmark (granularity)

in this Table since its a lookup so i suggest keeping it Denormalized, it will save you alot in your coding specially when joining tables and searching and reporting. and maintenance is easier, then 5 separate tables

the other table is the table to collect the data

Table 2 (Master)
-Tourist
-LocationID
-ExtraFields
-...

so in this table you collect info about tourists and all you have to do is store the locationid, which is your granularity, this way the Locationid, has the Landmark,city,region,country,continent Since the Landmark is definetly in a city and the city is in a country and the country is in a region.. (no need for 5 tables to keep them separate, there will be a big chance of invalid inputs or data if not maintained properly)

Option 2:

i suggest graph database, check this post at SO , about using Neo4J for GIS and an implementation of GIS by Neo4J

PostgreSQL Upsert – Column Reference is Ambiguous When Upserting Element

From the docs,

conflict_action specifies an alternative ON CONFLICT action. It can be either DO NOTHING, or a DO UPDATE clause specifying the exact details of the UPDATE action to be performed in case of a conflict. The SET and WHERE clauses in ON CONFLICT DO UPDATE have access to the existing row using the table's name (or an alias), and to rows proposed for insertion using the special excluded table. SELECT privilege is required on any column in the target table where corresponding excluded columns are read.

So instead, try this per ypercubeᵀᴹ

INSERT INTO accounts (id, token, affiliate_code)
VALUES (value1, value2, value3)
ON CONFLICT (id) DO
UPDATE SET token = value2,
  affiliate_code = COALESCE(accounts.affiliate_code, excluded.affiliate_code);