PostgreSQL Unique Constraint – Remove Extra Email Characters to Enforce Uniqueness

constraintpostgresqlunique-constraint

Gmail ignores . inside an email address e.g abc@gmail.com and a.b.c@gmail.com are same
Gmail also considers same account if account has any number after + symbol. e.g xyz@gmail and xyz+1@gmail.com are same

I want to put constraint for uniqueness on emails which can handle above case in postgres but not successful so far. Simple unique and saving in lower case doesn't work

Best Answer

To answer your question (+1 for an interesting, challenging and relevant question). I did the following:

Created a table:

CREATE TABLE regexp_test (my_email_string VARCHAR(255));

Populated it with sample data (as per the question, the only place a +sign can appear in the email address is at the end of the first (local) part with 1 or more digits between it and the @ sign:

INSERT INTO regexp_test 
VALUES
('as.dfWEWd.fs+4@stuff.com'), 
('aS.Sdf.34343a.sfs@yahoo.com'),
('adsFFdsf"£"££$£$+15@tester.cn'),   <-- non-standard characters!
('test_dots.asdf.+12345@blah.au.com')  

-- multiple dots after the @ sign
-- this is the tricky part!

And then ran this query:

SELECT 
  LOWER
  (
    REGEXP_REPLACE
    (
      REPLACE
      (
        SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1)
        , '.', ''
      ) 
        || 
        '@' 
        || 
        SUBSTRING(my_email_string, '[^@]*$'), 
      '[^a-zA-Z0-9@.]', '', 'g'
    )
  ) AS "New email" 
FROM regexp_test;

Explanation of query below.

Result:

               New Email
    asdfwewdfs@stuff.com
assdf34343asfs@yahoo.com
     adsffdsf@tester.cn
testdotsasdf@blah.au.com

All this is available on a fiddle here.

TL;DR - this will work for an email which contains multiple full stops (dots or periods) either in the first (local) part of the email (these will be removed as per the question) OR in the second (domain) part - this part can legitimately have multiple dots - say blah.au.com. Any symbols which are not alphanumeric will be removed from the local part. Anything between a + and the @ sign in the local part will also be removed.

In particular, emails like john.m.doe+15@blah.au.com will be converted to johnmdoe@blah.au.com.

To explain the query, it is best to go from the inside out (you can see some of the logical progression on the fiddle itself).

The first inner-most part uses SPLIT_PART() twice:

SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1)

Which gets (chomps) the email string up to the first @ sign (and only @ signe in a valid email address). Then the second SPLIT_STRING chomps the string still further and removes anything from a + sign to the @ symbol inclusive.

Then the outer

REPLACE
(
  SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1) 
  , '.', ''
)

removes (replaces '.' with '' - nothing) all the dots in the first (local) part of the string as per the question, but not in the second (domain) part (addresses such as @blah.com.au are valid).

Now, I get the email domain with the SUBSTRING() function using a regular expression as explained here:

SUBSTRING(my_email_string, '[^@]*$'),

this gives us the domain from after the @ sign to the end of the string.

I then use the SQL string concatenation operator (|| - double pipe) to join the two strings and also reinsert the removed @ sign.

So, my data now looks like this (result_1):

asdfWEWdfs@stuff.com
aSSdf34343asfs@yahoo.com
adsFFdsf"£"££$£$@tester.cn
test_dotsasdf@blah.au.com

So, the only dots and at signs remaining in the email string are valid, but we still have to get rid of the + signs (from the question) and other symbols (for checking) from the string. For this we use REGEXP_REPLACE.

So, the query now looks like this (result_1 substituted in for clarity).

REGEXP_REPLACE(result_1, '[^a-zA-Z0-9@.]', '', 'g')

This replaces all ('g' for global) characters not (^) in a-z or A-Z or the @ symbol or a dot with nothing ('').

In fact, this part might be redundant. If there are no symbols other than .s (dots) anywhere in the local part of the string, or a + sign followed by digits and then an @ sign, then the REGEXP_REPLACEis unnecessary! I threw it in just in case!

The LOWER() just tidies it up - I prefer emails in lower case!

Et voilà!

Alternative with COLLATE "C"

Instead of creating two indexes, there is another alternative for xxx_pattern_ops indexes that may be preferable. The documentation:

The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale.

And:

The index automatically uses the collation of the underlying column.

You can create the column without collation (using COLLATE "C"). Then the default operator class behaves the same way as text_pattern_ops would - plus the index can be used with all the standard operators.

CREATE TABLE book2 (
   book_id serial PRIMARY KEY,
   book    text NOT NULL COLLATE "C" UNIQUE  -- that's all!
);

Now, LIKE can use the index:

SELECT * FROM book2 WHERE book LIKE 'foo%';

But ILIKE still can't:

SELECT * FROM book2 WHERE book ILIKE 'foo%';

SQL Fiddle.

Consider a trigram index using the additional module pg_trgm for a more general solution.:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Postgresql – Unique across tables

To enforce unique email addresses, I would remove all competing email columns and store them in one central email table for all active emails. And another table for deleted emails:

CREATE TABLE users (
  user_id  serial PRIMARY KEY
, username text UNIQUE NOT NULL
, email    text UNIQUE -- FK added below  -- can also be NOT NULL
);

CREATE TABLE email (
  email    text PRIMARY KEY
, user_id  int NOT NULL REFERENCES users ON DELETE CASCADE
, UNIQUE (user_id, email)  -- seems redundant, but required for FK
);

ALTER TABLE users ADD CONSTRAINT users_primary_email_fkey
FOREIGN KEY (user_id, email) REFERENCES email (user_id, email);

CREATE TABLE email_deleted (
  email_id serial PRIMARY KEY
, email    text NOT NULL  -- not necessarily unique
, user_id  int NOT NULL REFERENCES users ON DELETE CASCADE
);

This way:

Active emails are unique, enforced by the PK constraint of email.
Each user can have any number of active and deleted emails, but ...
Each user can only have one primary email.
Every email is always owned by one user and is deleted with the user.
To soft-delete an email (without losing it and its affiliation to its user, move the row from email to email_deleted.
- The primary email of a user cannot be deleted this way, because the primary email must not be deleted.
I designed the FK constraint users_primary_email_fkey to span (user_id, email), which seems redundant at first. But this way the primary email can only be an email that is actually owned by the same user.
Due to the default MATCH SIMPLE behavior of FK constraints, you can still enter a user without primary email, because the FK constraint is not enforced if any of the columns is null.
Details:
- Two-column foreign key constraint only when third column is NOT NULL

The UNIQUE constraint on users.email is redundant for this solution, but it may be useful for other reasons. The automatically created index should come in handy (for instance for the last query in this answer).

The only thing that's not enforced this way is that every user has a primary email. You can do this, too. Add NOT NULL constraint to users.email

UNIQUE (user_id, email) is required for the FK constraint:

How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?

You have doubtless spotted the circular reference in the above model. Contrary to what one might expect, this just works.

As long as users.email can be NULL, it's trivial:

INSERT user without email.
INSERT email referencing the owning user_id.
UPDATE user to set it's primary email if applicable.

It even works with users.email set to NOT NULL. You have to insert user and email at the same time though:

WITH u AS (
   INSERT INTO users(username, email)
   VALUES ('user_foo', 'foo@mail.com')
   RETURNING email, user_id
   )
INSERT INTO email (email, user_id)
SELECT email, user_id
FROM   u;

IMMEDIATE FK constraints (the default) are checked at the end of each statement. The above is one statement. That's why it works where two separate statements would fail. Detailed explanation:

How to deal with mutually recursive inserts

To get all emails of a user as array, with the primary email first:

SELECT u.*, e.emails
FROM   users u
     , LATERAL (
      SELECT ARRAY (
      SELECT email
      FROM   email
      WHERE  user_id = u.user_id
      ORDER  BY (email <> u.email)  -- sort primary email first
      ) AS emails
   ) e
WHERE  user_id = 1;

You could create a VIEW with this for ease of use.
LATERAL requires Postgres 9.3. use a correlated subquery in pg 9.2:

SELECT *, ARRAY (
             SELECT email
             FROM   email
             WHERE  user_id = u.user_id
             ORDER  BY (email <> u.email)  -- sort primary email first
             ) AS emails
FROM   users u
WHERE  user_id = 1;

To soft-delete an email:

WITH del AS (
   DELETE FROM email
   WHERE  email = 'spam@mail.com'
   RETURNING email, user_id
   )
INSERT INTO email_deleted (email, user_id)
SELECT email, user_id FROM del;

To soft-delete the primary email of a given user:

WITH upd AS (
   UPDATE users u
   SET    email = NULL
   FROM   (SELECT user_id, email FROM users WHERE user_id = 123 FOR UPDATE) old
   WHERE  old.user_id = u.user_id
   AND    u.user_id = 1
   RETURNING old.*
   )
,    del AS (
   DELETE FROM email
   USING  upd
   WHERE  email.email = upd.email
   )
INSERT INTO email_deleted (email, user_id)
SELECT email, user_id FROM upd;

Details:

Return pre-UPDATE Column Values Using SQL Only - PostgreSQL Version

Quick test for all of the above: SQL Fiddle.

Best Answer

Related Solutions

PostgreSQL – Unique Index vs Unique Constraint with Operator Class

Alternative with COLLATE "C"

Postgresql – Unique across tables

Related Question