PostgreSQL – Duplicate Row with Primary Key

default valuedynamic-sqlpostgresqlpostgresql-9.4

Assume I have a table as follows named people, where id is a Primary Key:

+-----------+---------+---------+
|  id       |  fname  |  lname  |
| (integer) | (text)  | (text)  |
+===========+=========+=========+
|  1        | Daniel  | Edwards |
|  2        | Fred    | Holt    |
|  3        | Henry   | Smith   |
+-----------+---------+---------+

I'm trying to write a row duplication query which is robust enough to account for schema changes to the table. Any time I add a column to the table, I don't want to have to go back and modify the duplication query.

I know I can do this, which will duplicate record id 2 and give the duplicated record a new id:

INSERT INTO people (fname, lname) SELECT fname, lname FROM people WHERE id = 2;

However if I add an age column, I'll need to modify the query to also account for the age column.

Obviously I can't do the following, because it will also duplicate the primary key, resulting in a duplicate key value violates unique constraint — And, I don't want them to share the same id anyway:

INSERT INTO people SELECT * FROM people WHERE id = 2

With that said, what would be a reasonable approach to solving this challenge? I would prefer to stay away from stored procedures, but I'm not 100% against them, I suppose …

Best Answer

Simple with `hstore`

If you have the additional module hstore installed (instructions in link below), there is a surprisingly simple way to replace the value(s) of individual field(s) without knowing anything about other columns:

Basic example: duplicate the row with id = 2 but replace 2 with 3:

INSERT INTO people
SELECT (p #= hstore('id', '3')).* FROM people p WHERE id = 2;

Details:

Assuming (since it's not defined in the question) that people.id is a serial column with an attached sequence, you'll want the next value from the sequence. We can determine the sequence name with pg_get_serial_sequence(). Details:

PostgreSQL SELECT primary key as "serial" or "bigserial"

Or you can just hard-code the sequence name if it's not going to change.
We would have this query:

~~INSERT INTO people SELECT (p #= hstore('id', nextval(pg_get_serial_sequence('people', 'id'))::text)).* FROM people p WHERE id = 2;~~

Which works, but suffers from a weakness in the Postgres query planner: The expression is evaluated separately for every single column in the row, wasting sequence numbers and performance. To avoid this, move the expression into a subqery and decompose the row once only:

INSERT INTO people
SELECT (p1).*
FROM  (
   SELECT p #= hstore('id', nextval(pg_get_serial_sequence('people', 'id'))::text) AS p1
   FROM   people p WHERE id = 2
   ) sub;

Probably fastest for a single (or few) row(s) at once.

json / jsonb

If you don't have hstore installed and can't install additional modules, you can do a similar trick with json_populate_record() or jsonb_populate_record(), but that capability is undocumented and may be unreliable.

How to set value of composite variable field using dynamic SQL

Transient temporary table

Another simple solution would be to use a transient temporary like this:

BEGIN;
CREATE TEMP TABLE people_tmp ON COMMIT DROP AS
SELECT * FROM people WHERE id = 2;
UPDATE people_tmp SET id = nextval(pg_get_serial_sequence('people', 'id'));
INSERT INTO people TABLE people_tmp;
COMMIT;

I added ON COMMIT DROP to drop the table automatically at the end of the transaction. Consequently, I also wrapped the operation into a transaction of its own. Neither is strictly necessary.

This offers a wide range of additional options - you can do anything with the row before inserting, but it's going to be a bit slower due to the overhead of creating and dropping a temp table.

This solution works for a single row or for any number of rows at once. Each row gets a new default value from the sequence automatically.

Using the short (SQL standard) notation TABLE people.

Dynamic SQL

For many rows at once, dynamic SQL is going to be fastest. Concatenate the columns from the system table pg_attribute or from the information schema and execute it dynamically in a DO statement or write a function for repeated use:

CREATE OR REPLACE FUNCTION f_row_copy(_tbl regclass, _id int, OUT row_ct int) AS
$func$
BEGIN
   EXECUTE (
      SELECT format('INSERT INTO %1$s(%2$s) SELECT %2$s FROM %1$s WHERE id = $1',
                    _tbl, string_agg(quote_ident(attname), ', '))
      FROM   pg_attribute
      WHERE  attrelid = _tbl
      AND    NOT attisdropped  -- no dropped (dead) columns
      AND    attnum > 0        -- no system columns
      AND    attname <> 'id'   -- exclude id column
      )
   USING _id;

   GET DIAGNOSTICS row_ct = ROW_COUNT;  -- directly assign OUT parameter
END
$func$  LANGUAGE plpgsql;

Call:

SELECT f_row_copy('people', 9);

Works for any table with an integer column named id. You could easily make the column name dynamic, too ...

Maybe not your first choice since you wanted to stay away from stored procedures, but then again, it's not a "stored procedure" anyway ...

Advanced solution

A serial column is a special case. If you want to fill more or all columns with their respective default values, it gets more sophisticated. Consider this related answer:

Generate DEFAULT values in a CTE UPSERT using PostgreSQL 9.3

Related Solutions

Postgresql table with one integer column, sorted index, with duplicate primary key

I think you're asking how to impliment a solution you'e already decided on for a more general problem you don't describe. If you were to outline the actual problem that this is supposed to solve you might get better suggestions about how to solve it.

Working within the very limited information provided:

Update: I found your other question, which you really should've linked to. You seem to be trying to roll your own message queue. Don't do that. Read these:

Have I convinced you that you shouldn't try to do this yourself yet? Look into:

RabbitMQ
ZeroMQ
Job::Machine
ActiveMQ
PGQ
Celery

Some of what you want isn't available in current PostgreSQL versions. For example:

INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between two other rows, ordered by ID.`

That'd require an index-organized table, which PostgreSQL doesn't have yet. The closest you'll get would be a one-column table with a PRIMARY KEY. With regular VACUUM on PostgreSQL 9.2 you'd be able to use index-only scans to access it most of the time.

As for allowing duplicates, you don't really seem to want to permit them at all, you're just saying you want to work around concurrency issues by temporarily permitting them.

You can remove such duplicates during INSERT so the table its self doesn't need to permit them. However, that'll cause issues with:

INSERTs will happen in bulk (about 1000 per transaction) and must not fail, expect for disc full, etc. There must not be any chance for deadlocks.

... assuming that those inserts occur concurrently from multiple transactions. You'll have races between the checks for existence and the insert that can cause insert batches to fail and have to be re-tried.

I suspect that your best bet is to have a one-column table without a PRIMARY KEY. Just create an ordinary b-tree index on it, and leave the table without a PRIMARY KEY. Since it genuinely has no primary key (the only column may have duplicates) this is entirely reasonable.

(BTW, given that SQL is supposedly all about sets, it astounds me how awful it is at "add this entry to the set if not already present").

PostgreSQL – Insert Distinct Values from One Table into Another with Constraints

There are 3 possible kinds of duplicates:

Duplicates within the rows of the bulk insert.
Duplicates between inserted rows and existing rows.
Duplicates between inserted rows and concurrently inserted / updated rows from other transactions.

Just like I explained in this closely related answer:

Using EXCEPTION to ignore duplicates during bulk inserts

But things have become easier for 2. and 3. since Postgres 9.5 introduced UPSERT (INSERT .. ON CONFLICT DO NOTHING).

INSERT INTO emails(tag,email)
SELECT DISTINCT 655, email
FROM   emails_temp
ON CONFLICT (email) DO NOTHING;

If your duplicates only stem from duplicate entries in the source (1.), like you indicated, then all you need is DISTINCT. Works in any version of Postgres:

INSERT INTO emails(tag,email)
SELECT DISTINCT 655, email
FROM   emails_temp;