Postgresql – Phone number with different formats lookup

postgresql

I have different phone formats stored in my db, such as :

727.170.4799
1-416-958-6390
1.561.374.4268
(813) 929-5892

There are approximately ~2 million records in this table so modifying would take a long time. And I need to find records by performing phone lookup.

One thing I control is the input, I can guarantee that it's going to be numeric. But what I don't know is whether they're going to type in numeric number with or without the country code so taking this number as an example 1-416-958-6390 both of these used in lookup:

14169586390
4169586390

Should return the same row (1-416-958-6390)

How do I do phone for phone numbers as efficiently as I can?

This is what I got but it's super slow (as expected) :

select * from patients
WHERE ( REPLACE(phone_number, '-', '') LIKE '%4169586390' )

Best Answer

To clean up the numbers use a regex that removes everything that's not a numeric digit:

regexp_replace(phone_number, '[^0-9]', '', 'g') 

You can create an index on that, but a regular B-Tree index won't be used LIKE conditions where the wildcard is on the left.

In Postgres you have two choices: you can create a regular B-Tree index on the reverse value and use that instead:

create index on patients ( reverse(regexp_replace(phone_number, '[^0-9]', '', 'g')) );

Then use that expression and compare it with the reversed input:

select * 
from patients
where reverse(regexp_replace(phone_number, '[^0-9]', '', 'g')) like reverse('4169586390')||'%'

The other choice is to install the pg_trgm extension and create a trigram index on the expression. That way you can use the input "as is":

create index on patients 
   using gist ( regexp_replace(phone_number, '[^0-9]', '', 'g') gist_trgm_ops);

That index is then usable for:

select * 
from patients
where regexp_replace(phone_number, '[^0-9]', '', 'g') like '%4169586390'

However, GiST indexes are bigger and a bit slower to maintain than B-Tree indexes.


If you don't want to repeat that regex expression in every query you can create a view that includes that, e.g.:

create view patients_clean_phones
as
select ... other columns ..., regexp_replace(phone_number, '[^0-9]', '', 'g') as clean_phone
from patients;

Then use (assuming you are using the GiST index, so no reverse required):

select *
from patients_clean_phones
where clean_phone like '4169586390%';

Postgres is smart enough to still use the index in that case.


If you are on Postgres 12, an alternative to the view is to use a computed column with the regex expression, then create the index on the computed column.