PostgreSQL Unique Constraint – Remove Extra Email Characters to Enforce Uniqueness

constraintpostgresqlunique-constraint

  • Gmail ignores . inside an email address e.g abc@gmail.com and a.b.c@gmail.com are same
  • Gmail also considers same account if account has any number after + symbol. e.g xyz@gmail and xyz+1@gmail.com are same

I want to put constraint for uniqueness on emails which can handle above case in postgres but not successful so far. Simple unique and saving in lower case doesn't work

Best Answer

To answer your question (+1 for an interesting, challenging and relevant question). I did the following:

Created a table:

CREATE TABLE regexp_test (my_email_string VARCHAR(255));

Populated it with sample data (as per the question, the only place a +sign can appear in the email address is at the end of the first (local) part with 1 or more digits between it and the @ sign:

INSERT INTO regexp_test 
VALUES
('as.dfWEWd.fs+4@stuff.com'), 
('aS.Sdf.34343a.sfs@yahoo.com'),
('adsFFdsf"£"££$£$+15@tester.cn'),   <-- non-standard characters!
('test_dots.asdf.+12345@blah.au.com')  

-- multiple dots after the @ sign
-- this is the tricky part!  

And then ran this query:

SELECT 
  LOWER
  (
    REGEXP_REPLACE
    (
      REPLACE
      (
        SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1)
        , '.', ''
      ) 
        || 
        '@' 
        || 
        SUBSTRING(my_email_string, '[^@]*$'), 
      '[^a-zA-Z0-9@.]', '', 'g'
    )
  ) AS "New email" 
FROM regexp_test;

Explanation of query below.

Result:

               New Email
    asdfwewdfs@stuff.com
assdf34343asfs@yahoo.com
     adsffdsf@tester.cn
testdotsasdf@blah.au.com

All this is available on a fiddle here.

TL;DR - this will work for an email which contains multiple full stops (dots or periods) either in the first (local) part of the email (these will be removed as per the question) OR in the second (domain) part - this part can legitimately have multiple dots - say blah.au.com. Any symbols which are not alphanumeric will be removed from the local part. Anything between a + and the @ sign in the local part will also be removed.

In particular, emails like john.m.doe+15@blah.au.com will be converted to johnmdoe@blah.au.com.

To explain the query, it is best to go from the inside out (you can see some of the logical progression on the fiddle itself).

The first inner-most part uses SPLIT_PART() twice:

SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1) 

Which gets (chomps) the email string up to the first @ sign (and only @ signe in a valid email address). Then the second SPLIT_STRING chomps the string still further and removes anything from a + sign to the @ symbol inclusive.

Then the outer

REPLACE
(
  SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1) 
  , '.', ''
) 

removes (replaces '.' with '' - nothing) all the dots in the first (local) part of the string as per the question, but not in the second (domain) part (addresses such as @blah.com.au are valid).

Now, I get the email domain with the SUBSTRING() function using a regular expression as explained here:

SUBSTRING(my_email_string, '[^@]*$'), 

this gives us the domain from after the @ sign to the end of the string.

I then use the SQL string concatenation operator (|| - double pipe) to join the two strings and also reinsert the removed @ sign.

So, my data now looks like this (result_1):

asdfWEWdfs@stuff.com
aSSdf34343asfs@yahoo.com
adsFFdsf"£"££$£$@tester.cn
test_dotsasdf@blah.au.com

So, the only dots and at signs remaining in the email string are valid, but we still have to get rid of the + signs (from the question) and other symbols (for checking) from the string. For this we use REGEXP_REPLACE.

So, the query now looks like this (result_1 substituted in for clarity).

REGEXP_REPLACE(result_1, '[^a-zA-Z0-9@.]', '', 'g')

This replaces all ('g' for global) characters not (^) in a-z or A-Z or the @ symbol or a dot with nothing ('').

In fact, this part might be redundant. If there are no symbols other than .s (dots) anywhere in the local part of the string, or a + sign followed by digits and then an @ sign, then the REGEXP_REPLACEis unnecessary! I threw it in just in case!

The LOWER() just tidies it up - I prefer emails in lower case!

Et voilà!