Postgresql a strange tsquery behavior

full-text-searchpostgresql

select  to_tsvector('english', 'Ice-cream') @@ to_tsquery('english', 'Ice<->cream');

is True. While:

select  to_tsvector('english', 'iDream  Ice-cream  iScream') @@ to_tsquery('english', 'iDream<->Ice<->cream<->iScream');

is False.

For my understand. Both add same word before and after the match one. It will keep the same answer.

Best Answer

That's because full-text search treats hyphenated words specially:

SELECT to_tsvector('english', 'iDream  Ice cream  iScream');

               to_tsvector                
------------------------------------------
 'cream':3 'ice':2 'idream':1 'iscream':4
(1 row)

The numbers behind the lexemes mark the position they had in the original text (cream is the third word, and so on). That is used for phrase search.

SELECT to_tsvector('english', 'iDream  Ice-cream  iScream');

                      to_tsvector                       
--------------------------------------------------------
 'cream':4 'ice':3 'ice-cream':2 'idream':1 'iscream':5
(1 row)

You see that the original hyphenated word is at the second position, and the parts are represented as following the hyphenated word.

So ice cream is not the same as ice-cream for PostgreSQL full text search. In the first case, ice immediately follows idream, but not in the second case. That is why your query returns FALSE.

Look at what the parser does:

SELECT alias, token, lexemes FROM ts_debug('english', 'iDream  Ice-cream  iScream');

      alias      |   token   |   lexemes   
-----------------+-----------+-------------
 asciiword       | iDream    | {idream}
 blank           |           | 
 asciihword      | Ice-cream | {ice-cream}
 hword_asciipart | Ice       | {ice}
 blank           | -         | 
 hword_asciipart | cream     | {cream}
 blank           |           | 
 asciiword       | iScream   | {iscream}
(8 rows)

Perhaps the solution you are looking for would be to ignore hyphenated words and just keep their parts:

CREATE TEXT SEARCH CONFIGURATION en_no_hyphen
   (COPY = english);

ALTER TEXT SEARCH CONFIGURATION en_no_hyphen
   DROP MAPPING FOR asciihword, hword;

SELECT to_tsvector('en_no_hyphen', 'iDream  Ice-cream  iScream')
       @@ to_tsquery('en_no_hyphen', 'iDream<->Ice<->cream<->iScream');

 ?column? 
----------
 t
(1 row)

Workaround

You can however use functions in an index. And functions can references external tables. That said, using this is kind of hack because changing the table (common.lang) will require reindexing and clearing the session cache.

CREATE FUNCTION common.mylookup(id int)
RETURNS regconfig AS $$
  SELECT name::regconfig
  FROM common.lang
  WHERE id = id
$$
LANGUAGE sql
IMMUTABLE;

CREATE INDEX
  ON source.user
  USING GIN (to_tsvector(common.mylookup(profile_lang_id), name || ' ' || screen_name ));

You can mark functions as IMMUTABLE which makes this permissible. If the underlying table mutates you'll have to REINDEX the index. In projects that use this hack, like PostGIS, they recreate indexes on point releases.

Followup

@EvanCarroll, I'm trying to create the tsvector on name || ' ' || screen_name. – Brooks 4 mins ago

Full text search isn't there to do what you think it does. It's not there to search multiple fields. It's there to vectorize word content and make use of dictionaries, stubbing, lexers, gazetteers, stop-word elimination, and a slew of other tricks none of which apply. If this doesn't make sense to you, you'll have to read the docs. If what you want is grep then FTS is only seldom what you want. If you want to grep over small chunks of non-standard text (like names) it's not what you want. What you likely want trigram indexing.

If all you want is a %term% on two fields, you're better off just doing that with a trigram index.

CREATE EXTENSION pg_trgm;
CREATE INDEX ON source.user USING GIN ((name || ' ' || screen_name) gin_trgm_ops);
WHERE name || ' ' || screen_name like '%$1%';

Or even better,

CREATE INDEX ON source.user USING GIN (name gin_trgm_ops, screen_name gin_trgm_ops);
WHERE name LIKE '%$1%' OR screen_name LIKE '%$1%';

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – Using SELECT within to_tsvector call in CREATE INDEX

Workaround

Followup

Related Question