PostgreSQL – How to Modify Values on INSERT Based on Referenced Table Without Existing Foreign Key

bulk-insertperformancepostgresql-10trigger

Summary

I want to add rows to a (projected) very huge table, but switch 2 values for references in other tables. The catch is that the referenced tables should be updated automatically when new values arrive. I have a hunch about how to do it, but don't know how to execute.

There's a lengthy explanation below, but feel free to skip to the step by step diagram, which is pretty self-explanatory. Please keep in mind that this should happen entirely on the database, without relying on the application.

Background

I'll try to describe the full problem as well as I can here, along with the reasoning for the specific issue I'm trying to solve. If I'm solving the wrong thing, please call me out.

Note: I'll be using PostgreSQL 10 for this, and even though it might not be the best tool, I'm restricted to it for this problem.

The core of this application is a table containing, essentially, readings, where the readings come from (origin) and which actor is responsible for these readings.

However, I don't have much control over how many or which origins and actors there are. I must assume that I'll be recording new values for these fields as they come. (Of course the application layer will have some control over the process, but assume I have no control over it)

What I do know, though, is that they are usually medium-sized (40~200) strings.

And this is basically the whole motivation of the problem: the only "useful" values in the core table will be the readings, which are much smaller in size, while the actor and origin are mostly annotation. They will be occasionally useful, so I can't just discard them, but it seems extremely wasteful, storage-wise, to store their values for each row, not to mention having a lot less rows per page. The table is projected to contain on the order of tens of billions of rows, so this could be significant.

The application layer, however, shouldn't be aware of this optimization. As far as it's concerned, it should just INSERT INTO readings VALUES (reading, origin_name, actor_name);, sometimes one row at a time, sometimes multiple, either through an SQL client or an ORM.

Additional info

The order of magnitude for the "origin" values will be 3 to 4 times smaller than the readings table, while "actors" should be no more than a few dozens.

Even if having separate tables is not necessary to solve my problem, I'd need a table for actors anyway, to annotate a couple of useful stuff.

Even though I said rows could be inserted one at a time, mostly they'll be inserted by batch through an ORM, on the order of thousands per query. This could be important in terms of overhead if the solution involves running the check-create-substitute-insert for each row.

Proposed Solution

Database schema

CREATE TABLE origins (
  origin_id serial PRIMARY KEY,
  origin_name VARCHAR (255) NOT NULL,
);

CREATE TABLE actors (
  actor_id serial PRIMARY KEY,
  actor_name VARCHAR (255) NOT NULL,
  actor_comments VARCHAR (255) NOT NULL,
);

CREATE TABLE readings (
  id bigserial PRIMARY KEY,
  reading integer NOT NULL, -- Not a single value in actual table
  origin_id integer NOT NULL,
  actor_id integer NOT NULL,
  CONSTRAINT origin_id_fkey FOREIGN KEY (origin_id)
    REFERENCES origins (origin_id) MATCH SIMPLE
    ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT actor_id_fkey FOREIGN KEY (actor_id)
    REFERENCES actors (actor_id) MATCH SIMPLE
    ON UPDATE NO ACTION ON DELETE NO ACTION,
);

Idea – Create trigger after INSERT statement on dump_readings table

Basically, I would create a support table in which the raw values will be inserted, and trigger a function after each INSERT statement. It should get a unique set of both the origins and actors, add new values to their respective tables, return the reference id and swap the values for them. Then just insert into the actual table and clear the contents of this support one.

I feel like this would solve the problem, but am concerned about potential race conditions and general inefficiency of the whole process. I suppose this could happen in a single query, maybe a trigger modifying the INSERT itself.

So, is this solution good enough or am I looking at it the wrong way?

What I expect

Step 1
Step 2
Step 3
Step 4
Step 5

Best Answer

This is the most common use for INSTEAD OF triggers, creating updatable views.

Essentially you would:

  1. Create your base tables
  2. Create the view referencing those tables, presenting them as a single table
  3. Create INSTEAD OF triggers for insert/update/delete operations - these cause the default action (an error, because you can't update a view like that normally) to be ignored and instead do the work of modifying the base tables' data accordingly.

Anything accessing the table with standard CRUD operations (manual SQL statements, stored procedures, ORMs, ...) should be able to interact with the view as if it were a single table. Obviously it isn't one which will have an effect on the complexity of some query plans, but for simple cases like this that should not be a problem if your indexing is optimal.

I'm a SQL Server person mainly so don't know the exact pg syntax for this, but search for "updatable views postgres" and you should find many examples, perhaps even in the official documentation.

Then just insert into the actual table and clear the contents of this support one.

Because you can do this on views, there is no need for the extra "dummy" table.