PostgreSQL text search email parser does not handle + correctly

full-text-searchpostgresql

Default email parser does not parse emails with plus symbol (tested on 11.5):

For test@example.com it works fine:

select ts_debug('test@example.com');
+-----------------------------------------------------------------------------+
| ts_debug                                                                    |
|-----------------------------------------------------------------------------|
| (email,"Email address",test@example.com,{simple},simple,{test@example.com}) |
+-----------------------------------------------------------------------------+

But for test+fragment@example.com it treats + as a token separator:

select ts_debug('test+fragment@example.com');
+-------------------------------------------------------------------------------------+
| ts_debug                                                                            |
|-------------------------------------------------------------------------------------|
| (asciiword,"Word, all ASCII",test,{english_stem},english_stem,{test})               |
| (blank,"Space symbols",+,{},,)                                                      |
| (email,"Email address",fragment@example.com,{simple},simple,{fragment@example.com}) |
+-------------------------------------------------------------------------------------+

Is there an easy way to fix the email parser? I know that it is possible to implement custom parser but it sounds like an overkill for this issue. Also I can't find relevant example of such parser.

Best Answer

I don't have good news for you. If you could convince the PostgreSQL hackers that such a change is desirable, it probably still would not be made in the default parser. Because then a run of pg_upgrade would invalidate all existing FTS indexes, and they would have to be rebuilt. Even if the change in behavior was acceptable to impose between major versions, the index invalidation probably would not be.

You could fork the default parser to implement your own, but that is unreasonably difficult. While the CREATE TEXT SEARCH PARSER doc does document the names of the functions you need to implement, it gives no inkling into what those functions need to do, or even what their call signature is. It just refers you elsewhere, which in turn describes the behavior of the default parser, but in a way which is not useful for trying to implement your own parser. Finally, "contrib" offers no examples from which you could find inspiration, and I am not aware of reasonable examples anywhere else, either. There is some very old documentation, but it isn't clear how relevant it still is.

If you come up with a more satisfying solution, please do post your own answer about it.

Is your goal to have '+' behave the same as it currently does in every case except where it appears in an email? Or would you want '+' to be treated in general like a word character, not just in emails? Is '+' the only thing that is giving you trouble? Some of the other allegedly-valid examples on wiki also are not processed correctly, but those may be of no consequence to you.

Before starting to implement my parser, I would test to see if I could implement something acceptable at an earlier level, for example by just replacing '+' with '.' in both the documents and queries.