Postgresql – Postgres: Insert distinct values from one table into another table with constrains in place

constraintdistinctpostgresql

I'm trying to insert data from one PostgreSQL 9.3 table to another while removing duplicate entries.

I have two tables containing email addresses.

The main table contains emails and a tag for each email address. The combination (email, tag) must be unique, and for this there is a constraint Unique(email, tag) on it.

The second table get created on the fly from text files which contain only the email address. There are many duplicates there.

I need to import the data from the temp table into the main one without breaking the constraint above. For a certain file containing email addresses the tag is constant.

Table structures:

CREATE TABLE emails (   
  email character varying(128),
  tag bigint,
  CONSTRAINT "unique-tag-email" UNIQUE (email, tag) )

and

CREATE TABLE emails_temp (email character varying(128)

Here is my query:

insert into emails(tag,email) 
select 
 655,t.email 
from 
 emails_temp as t 
where 
 not exists ( select email from emails where email = t.email )

Note: 655 is just a tag of a certain group of email addresses.

This is the error I get:

ERROR: duplicate key value violates unique constraint
"unique-tag-email" SQL state: 23505 Detail: Key (email,
tag)=(user@hotmail.com, 655) already exists.

There are indeed, two email addresses user@hotmail.com in the file.

Useless to say, because of this error, there is nothing added in the main table (email).

What am I doing wrong?

Best Answer

There are 3 possible kinds of duplicates:

Duplicates within the rows of the bulk insert.
Duplicates between inserted rows and existing rows.
Duplicates between inserted rows and concurrently inserted / updated rows from other transactions.

Just like I explained in this closely related answer:

Using EXCEPTION to ignore duplicates during bulk inserts

But things have become easier for 2. and 3. since Postgres 9.5 introduced UPSERT (INSERT .. ON CONFLICT DO NOTHING).

INSERT INTO emails(tag,email)
SELECT DISTINCT 655, email
FROM   emails_temp
ON CONFLICT (email) DO NOTHING;

If your duplicates only stem from duplicate entries in the source (1.), like you indicated, then all you need is DISTINCT. Works in any version of Postgres:

INSERT INTO emails(tag,email)
SELECT DISTINCT 655, email
FROM   emails_temp;

Related Solutions

PostgreSQL – insert/update violates foreign key constraints

There are a few problems with your tables. I'll try to address the foreign keys first, since you question asked about them :)

But before that, we should realize that the two sets of tables (the first three you created and the second set, which you created after dropping the first set) are the same. Of course, the definition of Table3 in your second attempt has syntax and logical errors, but the basic idea is:

CREATE TABLE table3 (   
  "ID" bigint NOT NULL DEFAULT '0',   
  "DataID" bigint DEFAULT NULL,   
  "Address" numeric(20) DEFAULT NULL,   
  "Data" bigint DEFAULT NULL,
   PRIMARY KEY ("ID"),   
   FOREIGN KEY ("DataID") REFERENCES Table1("DataID") on delete cascade on update cascade,   
   FOREIGN KEY ("Address") REFERENCES Table2("Address") on delete cascade on update cascade
);

This definition tell PostgreSQL roughly the following: "Create a table with four columns, one will be the primary key (PK), the others can be NULL. If a new row is inserted, check DataID and Address: if they contain a non-NULL value (say 27856), then check Table1 for DataID˙and Table2 for Address. If there is no such value in those tables, then return an error." This last point which you've seen first:

ERROR: insert or update on table "Table3" violates foreign key constraint 
    "Table3_DataID_fkey" DETAIL: Key (DataID)=(27856) is not present in table "Table1".

So simple: if there is no row in Table1 where DataID = 27856, then you can't insert that row into Table3.

If you need that row, you should first insert a row into Table1 with DataID = 27856, and only then try to insert into Table3. If this seems to you not what you want, please describe in a few sentences what you want to achieve, and we can help with a good design.

And now about the other problems.

You define your PKs as

CREATE all_your_tables (
    first_column NOT NULL DEFAULT '0',   
    [...]
    PRIMARY KEY ("ID"),

A primary key means that all the items in it are different from each other, that is, the values are UNIQUE. If you give a static DEFAULT (like '0') to a UNIQUE column, you will experience bad surprises all the time. This is what you got in your third error message.

Furthermore, '0' means a text string, but not a number (bigint or numeric in your case). Use simply 0 instead (or don't use it at all, as I written above).

And a last point (I may be wrong here): in Table2, your Address field is set to numeric(20). At the same time, it is the PK of the table. The column name and the data type suggests that this address can change in the future. If this is true, than it is a very bad choice for a PK. Think about the following scenario: you have an address '1234567890454', which has a child in Table3 like

ID        DataID           Address             Data
123       3216547          1234567890454       654897564134569

Now that address happens to change to something other. How do you make your child row in Table3 follow its parent to the new address? (There are solutions for this, but can cause much confusion.) If this is your case, add an ID column to your table, which will not contain any information from the real world, it will simply serve as an identification value (that is, ID) for an address.

Postgresql – Unique across tables

To enforce unique email addresses, I would remove all competing email columns and store them in one central email table for all active emails. And another table for deleted emails:

CREATE TABLE users (
  user_id  serial PRIMARY KEY
, username text UNIQUE NOT NULL
, email    text UNIQUE -- FK added below  -- can also be NOT NULL
);

CREATE TABLE email (
  email    text PRIMARY KEY
, user_id  int NOT NULL REFERENCES users ON DELETE CASCADE
, UNIQUE (user_id, email)  -- seems redundant, but required for FK
);

ALTER TABLE users ADD CONSTRAINT users_primary_email_fkey
FOREIGN KEY (user_id, email) REFERENCES email (user_id, email);

CREATE TABLE email_deleted (
  email_id serial PRIMARY KEY
, email    text NOT NULL  -- not necessarily unique
, user_id  int NOT NULL REFERENCES users ON DELETE CASCADE
);

This way:

Active emails are unique, enforced by the PK constraint of email.
Each user can have any number of active and deleted emails, but ...
Each user can only have one primary email.
Every email is always owned by one user and is deleted with the user.
To soft-delete an email (without losing it and its affiliation to its user, move the row from email to email_deleted.
- The primary email of a user cannot be deleted this way, because the primary email must not be deleted.
I designed the FK constraint users_primary_email_fkey to span (user_id, email), which seems redundant at first. But this way the primary email can only be an email that is actually owned by the same user.
Due to the default MATCH SIMPLE behavior of FK constraints, you can still enter a user without primary email, because the FK constraint is not enforced if any of the columns is null.
Details:
- Two-column foreign key constraint only when third column is NOT NULL

The UNIQUE constraint on users.email is redundant for this solution, but it may be useful for other reasons. The automatically created index should come in handy (for instance for the last query in this answer).

The only thing that's not enforced this way is that every user has a primary email. You can do this, too. Add NOT NULL constraint to users.email

UNIQUE (user_id, email) is required for the FK constraint:

How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?

You have doubtless spotted the circular reference in the above model. Contrary to what one might expect, this just works.

As long as users.email can be NULL, it's trivial:

INSERT user without email.
INSERT email referencing the owning user_id.
UPDATE user to set it's primary email if applicable.

It even works with users.email set to NOT NULL. You have to insert user and email at the same time though:

WITH u AS (
   INSERT INTO users(username, email)
   VALUES ('user_foo', 'foo@mail.com')
   RETURNING email, user_id
   )
INSERT INTO email (email, user_id)
SELECT email, user_id
FROM   u;

IMMEDIATE FK constraints (the default) are checked at the end of each statement. The above is one statement. That's why it works where two separate statements would fail. Detailed explanation:

How to deal with mutually recursive inserts

To get all emails of a user as array, with the primary email first:

SELECT u.*, e.emails
FROM   users u
     , LATERAL (
      SELECT ARRAY (
      SELECT email
      FROM   email
      WHERE  user_id = u.user_id
      ORDER  BY (email <> u.email)  -- sort primary email first
      ) AS emails
   ) e
WHERE  user_id = 1;

You could create a VIEW with this for ease of use.
LATERAL requires Postgres 9.3. use a correlated subquery in pg 9.2:

SELECT *, ARRAY (
             SELECT email
             FROM   email
             WHERE  user_id = u.user_id
             ORDER  BY (email <> u.email)  -- sort primary email first
             ) AS emails
FROM   users u
WHERE  user_id = 1;

To soft-delete an email:

WITH del AS (
   DELETE FROM email
   WHERE  email = 'spam@mail.com'
   RETURNING email, user_id
   )
INSERT INTO email_deleted (email, user_id)
SELECT email, user_id FROM del;

To soft-delete the primary email of a given user:

WITH upd AS (
   UPDATE users u
   SET    email = NULL
   FROM   (SELECT user_id, email FROM users WHERE user_id = 123 FOR UPDATE) old
   WHERE  old.user_id = u.user_id
   AND    u.user_id = 1
   RETURNING old.*
   )
,    del AS (
   DELETE FROM email
   USING  upd
   WHERE  email.email = upd.email
   )
INSERT INTO email_deleted (email, user_id)
SELECT email, user_id FROM upd;

Details:

Return pre-UPDATE Column Values Using SQL Only - PostgreSQL Version

Quick test for all of the above: SQL Fiddle.

Best Answer

Related Solutions

PostgreSQL – insert/update violates foreign key constraints

Postgresql – Unique across tables

Related Question