PostgreSQL – Using tsvectors with gin_trgm_ops for Full-Text Search

full-text-searchindexpostgresql

I have a table like so.

ALTER TABLE clients ADD tsv tsvector;
CREATE INDEX index_clients_on_tsv
  ON clients
  USING gin (tsv);
CREATE TRIGGER clientstsvectorupdate BEFORE INSERT OR UPDATE
  ON clients FOR EACH ROW EXECUTE PROCEDURE
  tsvector_update_trigger(
    tsv, 'pg_catalog.english', first_name, middle_name, last_name, email, birthday_string
  );

But then I noticed that fuzzy searching was not working: "Chris", would not look up "Christopher". I did a little digging and found gin_trgm_ops Naturally the first thing I tried to do was update the index via

CREATE INDEX index_clients_on_tsv
  ON clients
  USING gin (tsv gin_trgm_ops);

But now I see an error relating to the fact I'm passing a tsvector to gin_trgm_ops.

(Postgrex.Error) ERROR 42804 (datatype_mismatch): operator class "gin_trgm_ops" does not accept data type tsvector

What is the best way to address what I'm trying to implement?

Best Answer

The gin_trgm_ops is an operator class over text. Your tsv column is a tsvector, and not text. So simply that won't work. A tsvector stores lexemes not words. That means the data you're searching for may not even be there. Observe,

SELECT to_tsvector('Christopher');
  to_tsvector  
---------------
 'christoph':1
(1 row)

That means after you convert Christopher to a tsvector, you lose some data in the algorithmic affix stemmer. That's fine for FTS, it's part of the compromise and how you achieve the amazing speed. It may not be desirable if that's what you're looking for though, so you have two options.

Use lexeme prefix matching with :* (assuming the lexeme has the needed information, if it doesn't you may be able to hack either the text search configuration to not-stem, or the stemmer to stem to your workload)
```
## returns false
SELECT to_tsvector('Christopher') @@ to_tsquery('Chris');
## returns true
SELECT to_tsvector('Christopher') @@ to_tsquery('Chris:*');
```
Index the actual column you're using to generate the TSV or wish to search in, using the trigram operator classes (like gin_trgm_ops). If you're always doing left-anchored search, you may consider the more efficient text_pattern_ops.

Related Solutions

Postgresql – slow queries – set weight to token type in tsquery – postgresql

Your ideal solution would probably be to write a custom, domain-data specific tsearch parser for your data, so it produced an ideal tsvector and tsquery for the input.

Alternatively, your search problem may be better suited to external tools like Apache Solr, which can work quite happily in conjunction with Pg.

When it comes to the "NoSQL" stuff (which NoSQL? Graph/tree? Key/value store? Object? Document? ...) it all tends to come down to details of your data model and how willing you are to write your own distributed search algorithms. You can often gain significant scale-out performance at the cost of considerably greater implementation effort.

MySQL – Fulltext Search Using Multiple Indexes

Unfortunately, this is how MySQL Query Optimizer treats FULLTEXT indexes. When a MATCH clause is the only clause in the WHERE, the index will be used. When used in conjunction with AND, the index may easily get overlooked.

I wrote about this behavior before in Mysql fulltext search my.cnf optimization

SUGGESTION : Rewrite the query as the union of two FULLTEXT searches

SELECT * FROM content 
WHERE MATCH(a, b, c) AGAINST ('keyword')
UNION
SELECT * FROM content 
WHERE MATCH(d) AGAINST ('keyword');

GIVE IT A TRY !!!

Best Answer

Related Solutions

Postgresql – slow queries – set weight to token type in tsquery – postgresql

MySQL – Fulltext Search Using Multiple Indexes

Related Question