PostgreSQL – How to Modify Values on INSERT Based on Referenced Table Without Existing Foreign Key

bulk-insertperformancepostgresql-10trigger

Summary

I want to add rows to a (projected) very huge table, but switch 2 values for references in other tables. The catch is that the referenced tables should be updated automatically when new values arrive. I have a hunch about how to do it, but don't know how to execute.

There's a lengthy explanation below, but feel free to skip to the step by step diagram, which is pretty self-explanatory. Please keep in mind that this should happen entirely on the database, without relying on the application.

Background

I'll try to describe the full problem as well as I can here, along with the reasoning for the specific issue I'm trying to solve. If I'm solving the wrong thing, please call me out.

Note: I'll be using PostgreSQL 10 for this, and even though it might not be the best tool, I'm restricted to it for this problem.

The core of this application is a table containing, essentially, readings, where the readings come from (origin) and which actor is responsible for these readings.

However, I don't have much control over how many or which origins and actors there are. I must assume that I'll be recording new values for these fields as they come. (Of course the application layer will have some control over the process, but assume I have no control over it)

What I do know, though, is that they are usually medium-sized (40~200) strings.

And this is basically the whole motivation of the problem: the only "useful" values in the core table will be the readings, which are much smaller in size, while the actor and origin are mostly annotation. They will be occasionally useful, so I can't just discard them, but it seems extremely wasteful, storage-wise, to store their values for each row, not to mention having a lot less rows per page. The table is projected to contain on the order of tens of billions of rows, so this could be significant.

The application layer, however, shouldn't be aware of this optimization. As far as it's concerned, it should just INSERT INTO readings VALUES (reading, origin_name, actor_name);, sometimes one row at a time, sometimes multiple, either through an SQL client or an ORM.

Additional info

The order of magnitude for the "origin" values will be 3 to 4 times smaller than the readings table, while "actors" should be no more than a few dozens.

Even if having separate tables is not necessary to solve my problem, I'd need a table for actors anyway, to annotate a couple of useful stuff.

Even though I said rows could be inserted one at a time, mostly they'll be inserted by batch through an ORM, on the order of thousands per query. This could be important in terms of overhead if the solution involves running the check-create-substitute-insert for each row.

Proposed Solution

Database schema

CREATE TABLE origins (
  origin_id serial PRIMARY KEY,
  origin_name VARCHAR (255) NOT NULL,
);

CREATE TABLE actors (
  actor_id serial PRIMARY KEY,
  actor_name VARCHAR (255) NOT NULL,
  actor_comments VARCHAR (255) NOT NULL,
);

CREATE TABLE readings (
  id bigserial PRIMARY KEY,
  reading integer NOT NULL, -- Not a single value in actual table
  origin_id integer NOT NULL,
  actor_id integer NOT NULL,
  CONSTRAINT origin_id_fkey FOREIGN KEY (origin_id)
    REFERENCES origins (origin_id) MATCH SIMPLE
    ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT actor_id_fkey FOREIGN KEY (actor_id)
    REFERENCES actors (actor_id) MATCH SIMPLE
    ON UPDATE NO ACTION ON DELETE NO ACTION,
);

Idea – Create trigger after `INSERT` statement on `dump_readings` table

Basically, I would create a support table in which the raw values will be inserted, and trigger a function after each INSERT statement. It should get a unique set of both the origins and actors, add new values to their respective tables, return the reference id and swap the values for them. Then just insert into the actual table and clear the contents of this support one.

I feel like this would solve the problem, but am concerned about potential race conditions and general inefficiency of the whole process. I suppose this could happen in a single query, maybe a trigger modifying the INSERT itself.

So, is this solution good enough or am I looking at it the wrong way?

What I expect

Best Answer

This is the most common use for INSTEAD OF triggers, creating updatable views.

Essentially you would:

Create your base tables
Create the view referencing those tables, presenting them as a single table
Create INSTEAD OF triggers for insert/update/delete operations - these cause the default action (an error, because you can't update a view like that normally) to be ignored and instead do the work of modifying the base tables' data accordingly.

Anything accessing the table with standard CRUD operations (manual SQL statements, stored procedures, ORMs, ...) should be able to interact with the view as if it were a single table. Obviously it isn't one which will have an effect on the complexity of some query plans, but for simple cases like this that should not be a problem if your indexing is optimal.

I'm a SQL Server person mainly so don't know the exact pg syntax for this, but search for "updatable views postgres" and you should find many examples, perhaps even in the official documentation.

Then just insert into the actual table and clear the contents of this support one.

Because you can do this on views, there is no need for the extra "dummy" table.

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Mysql – find and insert row to another table using thesql trigger

This should do the trick for you:

DELIMITER $$
DROP TRIGGER IF EXISTS `employee_INSERT` $$
CREATE TRIGGER `employee_INSERT` 
AFTER INSERT ON `employee`
FOR EACH ROW
BEGIN
    INSERT INTO employee_tools (Id, Tool)
    SELECT new.Id, tools.Tool_Name 
            FROM tools
            WHERE tools.Division = new.Division;
END $$
DELIMITER ;