This is a strategy question – but one I've been grappling with all week and could really use some wisdom-as-a-service.
Context
I am trying to piece together a clean, actionable PostgreSQL database with sanitized records from a ton of messy, noisy raw source tables.
Say we have contacts, companies, stores. Many contacts can work at a company. Multiple companies can be involved in a store (owner, maintenance, etc.) There are many-to-many (M:N) junctions between contacts_companies
and companies_stores
.
Table relationships:
contacts >-< contacts_companies >-< companies >-< companies_stores >-< stores
unique on: contacts:email, companies:name, stores:formatted_address
The multiple data sources have all sorts of overlaps between the above tables. Some have just store info, some have some contact and store info, etc. As I go through each data source, data will be cleansed and then upserted to the appropriate table.
Problem
The upsert operation is the problem. Inserting with an ON CONFLICT is fine for a specific table, but what about the junctions between?
To illustrate the problem, say we have a data source that has the following records:
{owner: "Alice A.", email: "A@Alice_stores.net", store_address:"1 main st"}
This record has
- a contact
{name: Alice A, email: A@Alice_stores.net}
, - a company
{name: Alice Stores}
- a store
{address: 1 main st}
The database might already have a store for 1 main st
, but not a contact/company for Alice
.
If we individually upsert the contact, company and store, we won't know what proper IDs to place in the junction tables:
- If INSERT, use new id for FK in junction.
- If Update, use existing id for fk in junction.
Question
What is the best strategy for upserting multiple tables from various sources when there are M:N relationships across the board?
One strategy that occurred to me is to query all tables/junctions, hold them in memory and perform all conditionals/updates in memory and then have a more forceful upsert. But this can quickly get out of hand…
Best Answer
One way could be using
returning
clause with cte to make all upserts in one statement. For illustration purposes I use 1 many-to-many relationship.