Postgresql – How to use character classes in a Postgres regex select

postgresqlregexregular expression

I'm trying to create a query that selects strings that contain the number 21, without hitting strings such as 210 or 121. My efforts below have shown my progress, in that I've managed to apply a constraint for the beginning of the word, but I'm stumped in terms of providing delimiter at the end.
I would like to match 21 and 21X where X is a nondigit character.
Normally I'd use [] match both situations where there is a nondigit character or the end of the string, but I'm stymied on this.

I can filter based on the start of the word and character classes work:

# SELECT "buildings"."street_name" FROM "buildings"  WHERE (street_name ~ '\m21[s0]');
 street_name
-------------
 21st
 210th

And here I can successfully use the nondigit (\D) constraint:

 # SELECT "buildings"."street_name" FROM "buildings"  WHERE (street_name ~ '\m21\D');
 street_name
-------------
21st

But when I try to use the nondigit constraint to define a class i.e. \m21[\D] (in an effort to match the string '21') it will fail.

How can I define a regex that matches my target (in the case 21) with a non-digit suffix or string terminator ?

Best Answer

You can invert a character class with ^ as first character:

street_name ~ '\m21[^\d]'

Which is exactly the same as:

street_name ~ '\m21\D'

Or the long form:

street_name ~ '\m21[^[:digit:]]

Neither matches at the end of the string. I suggest two branches:

street_name ~ '\m21(\D|$)'

... which implements your requirements literally:

with a non-digit suffix or string terminator

Proper test tool

To check your requirements:

SELECT *
     , street_name ~ '\m21(\D|$)' AS erwin
     , street_name ~ '\m21\D*\M'  AS ruy  -- just without parentheses
FROM  (
   VALUES
   (1, '21'    , 'true'::bool)
 , (2, '21X'   , 't')
 , (3, '21 foo', 't')
 , (4, '21X11' , 't') -- !
 , (5, '121'   , 'f')
 , (6, '210'   , 'f')
 , (7, 'X21'   , 'f')
   ) t(id, street_name, should_match);

What about '21X11': digits *after a non-digit suffix?
My regexp accepts it, @Ruy's regexp rejects it.
As I read your question it should pass?


The (already inverted) class shorthand \D is not allowed inside another character classes. The manual:

Within bracket expressions, \d, \s, and \w lose their outer brackets, and \D, \S, and \W are illegal.

Bold emphasis mine.

You did not rule out 'X21' explicitly, but I assume you want to start with a word boundary like in your example: \m.