Postgresql – Postgres: Dictionary to remove leading zeroes from alpha-numerical words

full-text-searchpostgresql

I am running a full text search in Postgres 10 over texts which include a variety of alphanumerical IDs (as well as 'real' text).
My clients want to be able to find those IDs without the need to enter the leading zeroes. As the number of digits vary for different IDs I do not want to go down the path of adding extra zero-padded search queries.

I was wondering if there is a way to remove leading zeroes using a dictionary. But this would require some wildcards to catch an alphanumeric expression with leading zeroes. My understanding so far is that this is not supported by the ispell/hunspell dictionaries supported by Postgres 10.

I am currently thinking about using regexes like /\W[0]+(:alnum:)+/ and adding the matches to the text before feeding the text to the search index. This way I would catch both the zero-padded original and the variant without leading zeroes. But it does feel a little clumsy…

Does anybody have a better idea on how to do this using a full text search dictionary (or something similar)?

The texts I am search look like this:

'Max Mustermann 00023131021 Wunderstr. 17 Trallerhausen 002ed2323'

Example – they are typically much longer in reality and contain up to 50 different IDs.
Here, I would like to hit on '00023131021', '23131021', '2ed2323 & Wunder:*', etc.

Best Answer

Postgres text search only supports prefix matching, not postfix matching like you could use:

Depending on exact requirements a number of workarounds come to mind.

If all leading zeros are insignificant, you could remove those from the indexed column in an expression index like:

CREATE INDEX foo_idx ON tbl
USING GIN (to_tsvector('simple', regexp_replace(col, '\m0+', '', 'g')));

Or use a regular expression fitting your actual requirements:

dbfiddle here

Then match the index in queries like:

SELECT *
FROM   tbl
WHERE  to_tsvector('simple', regexp_replace(col, '\m0+', '', 'g'))
    @@ to_tsquery ('simple', '2ed2323 & Wunder:*');

Remove leading zeroes from the tsquery value as well - if those can be contained:

   ...
    @@ to_tsquery( 'simple', regexp_replace(my_querystring, '\m0+', '', 'g'));

Related: