Postgresql – Multi language full text search using postgresql

full-text-searchpostgresqlstring-searching

I am trying to implement full text search using postgresql for some images. I am storing some information about my images into a json field in my table. This json has a tags key where I have multiple languages, each one with tags (keywords), something like this:

"tags": {
    "en": ["blue female", "red female"],
    "es": ["hembra azul", "hembra roja"]
}

At this moment I don't have a clear idea how to store the tsvector considering that I have more languages.

One initial idea was to concatenate all those tsvectors into a single one and store it on a column in my table.

The second idea would be to make a different column for each language and store the corespondent vector into that column.

Which on would be the better one?
There is maybe another better approach?

Best Answer

You should definitely use a different column per language.

The main reason is that different languages have different stop words and stemming rules, so if you index something with to_tsvector('spanish', ...), you will not always find it with a to_tsquery('english', ...) and vice versa:

SELECT to_tsvector('spanish', 'hembra azul') @@ to_tsquery('english', 'hembra');
 ?column? 
----------
 f
(1 row)

Even better would be not to create a column per language, but only GIN indexes on to_tsvector('english', (tags->'tags'->'en')) and to_tsvector('spanish', (tags->'tags'->'es')). For example:

CREATE TABLE images (
   id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
   image bytea NOT NULL,
   tags jsonb NOT NULL
);

CREATE INDEX images_tags_en_idx ON images
   (to_tsvector('english', (tags->'tags'->'en')));

CREATE INDEX images_tags_es_idx ON images
   (to_tsvector('spanish', (tags->'tags'->'es')));

Then you can use the first index with

SELECT * FROM images
WHERE to_tsvector('english', (tags->'tags'->'en'))
      @@ to_tsquery('english', 'female');

Related Solutions

Postgresql – Full Text Search With PostgreSQL

This is not really a use case for full text search because full text relies on stemming the text and parsing the chunks into tokens. As you can see from keywords, '580h' is parsed as its own word because there's no language in which '580' is a "stem" of '580h'. You'd probably be better off with regular expression matching.

Here's a query that I worked up for you:

SELECT id, title 
  FROM stickers WHERE
    (title ~* '580')
      AND
    (title ~* 'case')
ORDER BY id

Postgresql – Structure for storing table data, lists, text, pictures within PostgreSQL JSON field

If you're looking for alternatives other than JSON, and you just need arrays of text name-value pairs collections, you could simply use PostgreSQL's arrays:

CREATE TYPE mydoc AS (content_type text, names text[], values text[]);
CREATE TABLE documentData(id int primary key, documentName text, docs mydoc[]);
INSERT INTO documentData(id,documentName,docs) VALUES (1, 'DocumentA', '{"(Table,\"{ColA,ColB,ColC}\",\"{val1,val2,val3}\")","(Table,\"{ColA,ColB,ColC}\",\"{val4,val5,val6}\")"}'::mydoc[])

SELECT d.id, d.documentName, dd.*
FROM documentData d
LEFT JOIN LATERAL unnest(d.docs) dd ON (true);

This won't get a significant advantage over JSON, other than portability to older PostgreSQL versions, but unnest and =ANY type of operations are often much simpler than the JSON functions, so it may be easier to query and manipulate, e.g.:

SELECT d.id, d.documentName, dd.content_type, pos, name, dd.values[pos] AS value
FROM documentData d
LEFT JOIN LATERAL unnest(d.docs) dd ON (true)
LEFT JOIN LATERAL unnest(dd.names) WITH ORDINALITY AS y (name, pos) ON (true)
WHERE dd.content_type = 'Table' AND name ~ '^Col[AB]';

Best Answer

Related Solutions

Postgresql – Full Text Search With PostgreSQL

Postgresql – Structure for storing table data, lists, text, pictures within PostgreSQL JSON field

Related Question