Using REGEX_LIKE to Implement a Check Constraint

constraintdatabase-designregexregular expression

Background

I am working on the database design for an application that tracks information about database development projects at my company. The project management office (PMO) assigns a psued-number to each project based on if the project's cost is counted toward merger integration or not. To illustrate, the number assigned will either start with INT (for a project contributing to merger integration) or NINT (for a project not contributing to merger integration). Then a sequential number is assigned. For example, INT175 would be 175'th project identified contributing to merger integration. An additional wrinkle is the PMO sometimes wants to create multiple projects with the same number as they are "related". So for example there could be an INT175a project, an INT175b project, and so on.

While the PMO considers this number to be the identifier for a project, in the database I will only use the column as an alternate unique key both to ensure duplicate projects are not created and as a search key for users. I want to implement a check constraint on that column to ensure only a valid number matching the rules I gave above can be entered.

What I've Tried – Regular Expressions

My first thought was to use a regular expression. In Oracle, my target DBMS, we have the REGEX_LIKE function that can be used. I researched the documentation (here, here, here, here, and here ) on regular expressions and found the options to be dizzying! For whatever reason my brain gets tangled up trying to take the list of all the various pattern matching options and apply it to my specific example. I'm one of those people who work best with examples. Unfortunately, I haven't found many examples out there. What I have found have either been too simple, or so complex I couldn't grasp them.

What I have tried thus far is something like this:

select c
from (
  select 'INT1756b' c from dual union all
  select 'INT175a' c from dual union all
  select 'INT75a' c from dual union all
  select 'INT75' c from dual union all
  select 'NINT2283a' c from dual union all
  select 'NINT2283' c from dual union all
  select 'NINT915c' c from dual union all
  select 'NINT915' c from dual union all
  select 'NINT95b' c from dual union all
  select 'NINT95' c from dual union all
  select 'ABC123' c from dual
) x
where REGEXP_LIKE ( c, '(NINT|INT)\d{2}[0-9]|[a-z]' );

My test "table" x has what I foresee as an example of all the possible combinations of valid numbers plus one invalid number. They will all either start with INT or NINT, and then have at least 2 numbers up to 4 numbers, and then optionally a lower case letter on the end.

My Struggle – Handling Varying Lengths

What I am really struggling with is how to deal with the varying lengths of the numbers. There can be 2, or 3, or 4 numbers, which may or may not then be followed by a lower case letter. What I came up with above, just as a starter, was to look for 2 numbers after the INT or NINT. I know there will always be at least two numbers. Then I just look to see if the rest are in the set of digits or lower case letters. Now I know this isn't good enough but it is all I could come up with after a day or two of racking my brain on this.

Plea for Help!

Has anyone done something similar with regular expressions who could point me in the right direction or give me an example? Is there a more elegant and simple way to implement this constraint? I appreciate any and all help and pointers!

Best Answer

This isn't really a database administration question, but...

Firstly, your pattern does not prevent entries like this: XXXINT123abogus, because it does not say that "INT" or "NINT" should be at the beginning of the string, neither does it say that the lowercase letter, if it's there, should be the last symbol of the string.

You may want to try something like:

 +----------------------------- begins with
 |+---------------------------- optional "N"
 ||  +------------------------- then "INT"
 ||  |  +---------------------- then digits
 ||  |  | +-------------------- minimum 2
 ||  |  | | +------------------ maximum 4
 ||  |  | | |   +-------------- then a lowercase letter
 ||  |  | | |   |  +----------- which is optional
 ||  |  | | |   |  |+---------- and nothing more
 ||  |  | | |   |  ||
 ||  |  | | |   |  ||
'^N?INT\d{2,4}[a-z]?$'