Postgresql – Full text search over multiple related tables: indices and performance

full-text-searchperformancepostgresqlpostgresql-performance

We have the following database structure

CREATE TABLE objects (
    id       int   PRIMARY KEY,
    name     text,
    address  text
);

CREATE TABLE tasks (
    id int           PRIMARY KEY,
    object_id int    NOT NULL,
    actor_id int     NOT NULL,
    description text
);

CREATE TABLE actors (
    id   int  PRIMARY KEY,
    name text
);

The user enters a whitespace-separated list of words (search terms, basically) and we have to search for tasks, that satisfy the following: the task is a "match" if each search term occurs at least once in concatenation of task's description, the name and address of its associated object and the name of its associated actor.

Now, if we are not concerned about performance, we can just do something like this (given query "foo bar"):

SELECT t.id, t.description
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id
WHERE to_tsvector(concat_ws(' ', t.description, o.name, o.address, a.name)) @@
    plainto_tsquery('foo bar');

Unfortunately, we are concerned about performance. The dataset will, probably, be as follows (and it is expected to grow):

  • about 10000 objects
  • about 1000 actors
  • about 100000 tasks evenly distributed between objects

What I've considered:

Make a denormalized table like this:

CREATE TABLE task_documents (
    id int PRIMARY KEY,
    doc tsvector
)

The field "doc" will contain the concatenation of task's description, associated object's name and address and actor's name. We will have to create an index over this field and it will be used in full text search queries. This table will be updated in update / insert triggers on tasks, actors, objects.

Drawbacks: tons of duplicated data (this one I am not quite concerned about), and updates to the master tables will become unpredictable in terms of number of rows updated (say, you update a name of some object and now suddenly you must update thousands of rows in task_documents).

Honestly, I don't have any more (good) ideas. It is obviously impossible to create an index, spanning 3 tables, so that it will be used for the WHERE clause in the original query.

UPD

Here's a sqlfiddle with DB schema and some data. I had to make it up, because we have no real data at the moment.

Best Answer

Optimization

You're going down the right track.

You either need to

  1. Denormalize
  2. Cache

Caching the results

What you probably want is a MATERIALIZED VIEW. This is easy and works reasonably well.

CREATE MATERIALIZED VIEW foo
AS
SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id 
;

Then just

SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');

Denormalizing the table

This can take a lot of forms, you've got this right though..

Redesign

Searching everything in a fuzzy fashion like this is a losing game. Even this knock off of Dungeon and Dragons meets Yahoo Answers has rules.

enter image description here

It becomes a lot easier to generate a query when you introduce syntax likes [text] for tagging, and is:answer to search just answers, rather than rebuilding Google and normalized indexes.