Postgresql – How to obtain the last inserted row with INSERT … SELECT in PostgreSQL

insertpostgresqlselect-into

I'm trying to batch-copy data in a specific order from one table to another in PostgreSQL 12-beta2. The table is not using sequences, but contains an composite unique Primary Key (user_id, object_id).

In order to determine where to start for the next batch, I'd like to start off from the last inserted row (WHERE user_id >= last_user_id AND object_id > last_object_id).

Starting off with this:

INSERT INTO dest_table
SELECT (user_id, object_id, object_type, colN, ...) 
FROM source_table 
ORDER BY user_id, colN, object_id  -- this is indexed
LIMIT 1000  -- batch size
RETURNING user_id, object_id;

… returns a table of 1000 tuples. I'd like to obtain the last inserted tuple from it.

I've tried to do a SELECT around it, like this:

SELECT user_id, object_id FROM (
    INSERT INTO dest_table
    SELECT (user_id, object_id, object_type, colN, ...) 
    FROM source_table 
    ORDER BY user_id, colN, object_id  -- this is indexed
    LIMIT 1000  -- batch size
    RETURNING user_id, object_id
)
ORDER BY user_id DESC, colN DESC, object_id DESC
LIMIT 1
RETURNING user_id, object_id;

But that returns a syntax error:

ERROR:  syntax error at or near "INTO"
LINE 2: INSERT INTO dest_table
               ^

I've also attempted RETURNING ... INTO variable as described here, but that fails too:

ERROR:  syntax error at or near "INTO"
LINE 23: RETURNING user_id, object_id INTO my_variable;
                                      ^

Do I need to create a function for this (e.g. plpgsql) or am I missing something obvious in plain SQL that let me do this? That would be highly favorable.

Best Answer

It seems that you need only the maximum of the last inserted (user_id,object_id) as a couple. The first INSERT could be:

WITH inserted_rows as (
INSERT INTO dest_table
SELECT user_id, object_id, object_type, colN, ...
FROM source_table 
ORDER BY user_id, colN, object_id  -- this is indexed
LIMIT 1000  -- batch size
RETURNING user_id, object_id
)
SELECT row(user_id,object_id) FROM inserted_rows
 ORDER BY 1 DESC LIMIT 1;

This form uses a row to refer to the (user_id,object_id) couple for convenience. Row elements are compared left-to-right as needed in your case to use this couple of values as a boundary. See Row Constructor Comparison in the documentation.

With this solution, the subsequent INSERTs would inject the value returned by the last preceding insert. This avoids looking up the last inserted row in dest_table.

WITH inserted_rows as (
INSERT INTO dest_table
SELECT user_id, object_id, object_type, colN, ...
FROM source_table
  WHERE row(user_id,object_id) > row(...inject the result of the previous query here...)
ORDER BY user_id, colN, object_id
LIMIT 1000
RETURNING user_id, object_id
)
SELECT row(user_id,object_id) FROM inserted_rows
ORDER BY 1 DESC LIMIT 1;

Do I need to create a function for this (e.g. plpgsql) or am I missing something obvious in plain SQL that let me do this? That would be highly favorable.

In plain SQL, you would do a single INSERT in one step without any batching. Since you want to break it into batches, there has to be some loop driving these INSERTs with a stop condition, and this procedural logic goes beyond pure SQL.

If you don't feel like reinjecting the boundary through a variable, it could be persisted in a dedicated, single-row table.