Postgresql – Convert Ruby regex to Postgres regex, for selecting invalid email addresses

perlpostgresqlregex

I've done some internet trawling, and some reading of the Postgres regex docs, but I'm at the stage where I think asking a question here is the best thing to do.

I have the following Ruby regex:

/\A[^@\s]+@([^@\s]+\.)+[^@\W]+\z/

Which I am trying to convert into a Postgres regex.
So far I have come up with:

'\\A[^@\\s]\+@([^@\\s]+\\.)\+[^@\\W]\+\\z'

As I found this in one article online:

Thus, + is a metacharacter in PostgreSQL, whereas \+ is a plain “plus” character.

and this in another:

The backslash is used to escape characters in PostgreSQL strings. So a regular expression like \w that contains a backslash becomes \\w when written as a literal string in a PostgreSQL statement. To match a single literal backslash, you’ll need the regex \\ which becomes \\\\ in PostgreSQL.

The total query I'm running is this:

SELECT email
FROM users
WHERE email !~ '\\A[^@\\s]\+@([^@\\s]+\\.)\+[^@\\W]\+\\z'

And I'm expecting it to return zero results, bit it looks like it is selecting all email addresses.

I'm expecting no results because I've used the operator !~ (I've also tried !~*), which I think returns non-matches, and because I'm pretty sure that all the email addresses in my database do match the Ruby regex I pasted at the top of this question. (This in turn is because we use Devise in a Rails app to validate all user emails, with the default email validation settings)

So I've come to the conclusion that I've not converted my regex format/syntax correctly for Postgres, but also that I don't know what else I'm missing. Please can you help me understand whether this is possible, and if so, what else I need to change.


Once I have successfully translated that regex, I have a second Ruby regex which I also want to translate into Postgres, which is this one:

/\A[a-zA-Z0-9.!\#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*\z/

This is because I want to find all the user emails that don't have a valid email address according to the spec here: https://html.spec.whatwg.org/multipage/input.html#valid-e-mail-address

Thanks so much in advance ?

Best Answer

You should debug regexp by first just selecting the raw text make sure it is perceived at the textual level the way you want.

select '\\A[^@\\s]\+@([^@\\s]+\\.)\+[^@\\W]\+\\z';

                 ?column?                 
------------------------------------------   
\\A[^@\\s]\+@([^@\\s]+\\.)\+[^@\\W]\+\\z

So the doubled backslashes did not get condensed to single, the way you probably thought they would. Note that this is with the default (on) setting of standard_conforming_strings setting.

So regexp parser is going to see a backslashed backslash, which will match a literal backslash character, followed by a literal 'A', which will match a literal 'A'. Which is obviously not what you want.

You can remove the double backslashes. Then you have to worry about whether all the notations you use are supported in PostgreSQL. \W is illegal in PostgreSQL within brackets, so at least that one is not supported.