Postgresql – Full text search over multiple related tables: indices and performance

We have the following database structure

CREATE TABLE objects (
    id       int   PRIMARY KEY,
    name     text,
    address  text
);

CREATE TABLE tasks (
    id int           PRIMARY KEY,
    object_id int    NOT NULL,
    actor_id int     NOT NULL,
    description text
);

CREATE TABLE actors (
    id   int  PRIMARY KEY,
    name text
);

The user enters a whitespace-separated list of words (search terms, basically) and we have to search for tasks, that satisfy the following: the task is a "match" if each search term occurs at least once in concatenation of task's description, the name and address of its associated object and the name of its associated actor.

Now, if we are not concerned about performance, we can just do something like this (given query "foo bar"):

SELECT t.id, t.description
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id
WHERE to_tsvector(concat_ws(' ', t.description, o.name, o.address, a.name)) @@
    plainto_tsquery('foo bar');

Unfortunately, we are concerned about performance. The dataset will, probably, be as follows (and it is expected to grow):

about 10000 objects
about 1000 actors
about 100000 tasks evenly distributed between objects

What I've considered:

Make a denormalized table like this:

CREATE TABLE task_documents (
    id int PRIMARY KEY,
    doc tsvector
)

The field "doc" will contain the concatenation of task's description, associated object's name and address and actor's name. We will have to create an index over this field and it will be used in full text search queries. This table will be updated in update / insert triggers on tasks, actors, objects.

Drawbacks: tons of duplicated data (this one I am not quite concerned about), and updates to the master tables will become unpredictable in terms of number of rows updated (say, you update a name of some object and now suddenly you must update thousands of rows in task_documents).

Honestly, I don't have any more (good) ideas. It is obviously impossible to create an index, spanning 3 tables, so that it will be used for the WHERE clause in the original query.

UPD

Here's a sqlfiddle with DB schema and some data. I had to make it up, because we have no real data at the moment.

Caching the results

What you probably want is a MATERIALIZED VIEW. This is easy and works reasonably well.

CREATE MATERIALIZED VIEW foo AS SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv FROM tasks AS t INNER JOIN actors AS a ON t.actor_id = a.id INNER JOIN objects AS o ON t.object_id = o.id ;

Then just

SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');

Redesign

Searching everything in a fuzzy fashion like this is a losing game. Even this knock off of Dungeon and Dragons meets Yahoo Answers has rules.

It becomes a lot easier to generate a query when you introduce syntax likes [text] for tagging, and is:answer to search just answers, rather than rebuilding Google and normalized indexes.

Postgresql – Full text search over multiple related tables: indices and performance

Best Answer

Optimization

Caching the results

Denormalizing the table

Redesign

Related Question

Best Answer

Optimization

Caching the results

Denormalizing the table

Redesign

Related Solutions

Sql-server – get similar Full Text Search functionality without using Full-Text Search

Sql-server – Full-text search in SQL Server on multiple tables fails

Related Question