Postgresql – Multi language full text search using postgresql

full-text-searchpostgresqlstring-searching

I am trying to implement full text search using postgresql for some images. I am storing some information about my images into a json field in my table. This json has a tags key where I have multiple languages, each one with tags (keywords), something like this:

"tags": {
    "en": ["blue female", "red female"],
    "es": ["hembra azul", "hembra roja"]
}

At this moment I don't have a clear idea how to store the tsvector considering that I have more languages.

One initial idea was to concatenate all those tsvectors into a single one and store it on a column in my table.

The second idea would be to make a different column for each language and store the corespondent vector into that column.

Which on would be the better one?
There is maybe another better approach?

Best Answer

You should definitely use a different column per language.

The main reason is that different languages have different stop words and stemming rules, so if you index something with to_tsvector('spanish', ...), you will not always find it with a to_tsquery('english', ...) and vice versa:

SELECT to_tsvector('spanish', 'hembra azul') @@ to_tsquery('english', 'hembra');
 ?column? 
----------
 f
(1 row)

Even better would be not to create a column per language, but only GIN indexes on to_tsvector('english', (tags->'tags'->'en')) and to_tsvector('spanish', (tags->'tags'->'es')). For example:

CREATE TABLE images (
   id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
   image bytea NOT NULL,
   tags jsonb NOT NULL
);

CREATE INDEX images_tags_en_idx ON images
   (to_tsvector('english', (tags->'tags'->'en')));

CREATE INDEX images_tags_es_idx ON images
   (to_tsvector('spanish', (tags->'tags'->'es')));

Then you can use the first index with

SELECT * FROM images
WHERE to_tsvector('english', (tags->'tags'->'en'))
      @@ to_tsquery('english', 'female');