PostgreSQL – Best Way to Store an Email Address

database-designdatatypespostgresql

What would be the right datatype to store email addresses in PostgreSQL?

I can use varchar (or even text), but I wonder if there is a more specific data type for emails.

Best Answer

Custom DOMAINs

I don't think using citext (case-insensitive) is enough[1]. Using PostgreSQL we can create a custom domain which is essentially some defined constraints over a type. We can create a domain for instance over the citext type, or over text.

Using HTML5 type=email spec

Currently the most correct answer to the question what is an e-mail address is specified in RFC5322. That spec is insanely complex[2], so much so that everything breaks it. HTML5 contains a different spec for email,

This requirement is a willful violation of RFC 5322, which defines a syntax for e-mail addresses that is simultaneously too strict (before the "@" character), too vague (after the "@" character), and too lax (allowing comments, whitespace characters, and quoted strings in manners unfamiliar to most users) to be of practical use here. [...] The following JavaScript- and Perl-compatible regular expression is an implementation of the above definition.

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

This is likely what you want, and if it's good enough for HTML5, it's probably good enough for you. We can make use of that directly in PostgreSQL. I also use citext here (which technically means you can simply the regex a bit visually by removing either the upper-case or lower-case).

CREATE EXTENSION citext;
CREATE DOMAIN email AS citext
  CHECK ( value ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$' );

Now you can do...

SELECT 'asdf@foobar.com'::email;

But not

SELECT 'asdf@foob,,ar.com'::email;
SELECT 'asd@f@foobar.com'::email;

Because both of those return

ERROR:  value for domain email violates check constraint "email_check"

Because this is based on citext too

SELECT 'asdf@foobar.com'::email = 'ASdf@fooBAR.com';

returns true by default.

Using plperlu/Email::Valid

As an important note, there is a more correct method of doing this that is far more complex using plperlu. If you need this level of correctness you do not want citext. Email::Valid can even check if the domain has an MX record (example in docs of Email::Valid)! First, add plperlu (requires superuser).

CREATE EXTENSION plperlu;

Then create the function, notice we mark at as an IMMUTABLE:

CREATE FUNCTION valid_email(text)
  RETURNS boolean
  LANGUAGE plperlu
  IMMUTABLE LEAKPROOF STRICT AS
$$
  use Email::Valid;
  my $email = shift;
  Email::Valid->address($email) or die "Invalid email address: $email\n";
  return 'true';
$$;

Then create the domain,

CREATE DOMAIN validemail AS text NOT NULL
  CONSTRAINT validemail_check CHECK (valid_email(VALUE));

Footnotes

  1. Using citext is technically wrong. SMTP defines local-part as being case sensitive. But, again, this is a case of the spec being stupid. It contains its own identity crises. The spec says local-part (the part before the @) "MAY be case-sensitive" ... "MUST BE treated as case sensitive" ... and yet "exploiting the case sensitivity of mailbox local-parts impedes interoperability and is discouraged."
  2. The spec for an email address is so complex, it's not even self-contained. Complex is truly an understatement, those making the spec don't even understand it.. From the docs on regular-expression.info

    Neither of these regexes enforce length limits on the overall email address or the local part or the domain names. RFC 5322 does not specify any length limitations. Those stem from limitations in other protocols like the SMTP protocol for actually sending email. RFC 1035 does state that domains must be 63 characters or less, but does not include that in its syntax specification. The reason is that a true regular language cannot enforce a length limit and disallow consecutive hyphens at the same time.