Oracle index with REGEXP_REPLACE is it a good idea

indexoracleregex

I have a table of contacts with 1 million registers and now the user wants to search in it by the telephone.

Trying to avoid a fullscan I decided to create a new index in this column, but the trick part is that the data in this column is not normalized. I can find telephones separate by spaces, by "-", and even a lot of different formats. (ex: (418) 123-1234, 418 123-1234, 4181231234, 55(11)12321-1234, …)

Found some articles saying to use REGEXP_REPLACE to take care of that… just removing everything and comparing always with numbers. So, my query for test would by like that:

select * from tb_contc WHERE REGEXP_REPLACE (TEL_CONTC,'[^[:digit:]]+','') = '1234567777';

First problem I've found is that that is not a valid index to create. Trying to create it returns me a " ORA-01450: maximum key length":

CREATE INDEX IX_06_TB_CONTC ON TB_CONTC REGEXP_REPLACE (TEL_CONTC,'[^[:digit:]]+','');

Apparently this kind of function in the index would create a index that is worst than making a full scan in the table… The next step was to add a SUBSTRING in the index, that allows to create it by limiting the size of the VARCHAR returned from my REGEX

CREATE INDEX IX_06_GR_CONTC ON GR_CONTC (substr( REGEXP_REPLACE (TEL_CONTC,'[^[:digit:]]+',''), 1, 20));

Things got real ugly… Besides, to force the use of this index I need to add the exactly same query in my select.

Wouldn't be bad if it it had a good performance, but instead what I got is:

Without the index: 15 seconds the first time I execute the query, all the others would take basically nothing
With the index: 2 seconds every time.

Wraping up: Am I doing a good thing by using this kind of index with REGEX? Is there a better way?

Best Answer

I don't have direct experience of an index with this kind of REGEX so I would avoid it by creating a 'standardised' telephone number column.

Create a process to go through the telephone numbers and store the clean, formatted version in the new column.

Perform the 'standardising' process on the user input and use that as a search against the new telephone column - which you will have indexed.

The other option is to go through a data clean up process to standardise all the telephone numbers and ensure that any on-going input method applies the same process/validation.

EDIT:

The virtual column option mentioned above may be of use to you as well.

ALTER TABLE contacts ADD(search_telephone AS (REGEXP_REPLACE(tel_no, 'pattern', 'replace')).