Postgresql – Full text search on multiple joined tables

full-text-searchjoin;postgresql

I'm trying to set up a full text search in PostgreSQL but coming across the problem that the more tables I join to the query the fewer the results I get back.

The tables I am trying to join are:

  • article
  • keyword
  • author
  • organisation

There is a many-to-many relationship between article and all the other tables.

I am basing my query on that found here and at a loss as to why it's failing. The query so far looks like this:

SELECT title
FROM (
  SELECT article_title AS title, 
  to_tsvector(article.article_title) ||
  to_tsvector(string_agg(keyword.keyword, '')) ||
  to_tsvector(string_agg(organisation.org_name, '')) ||
  to_tsvector(string_agg(author.name, ''))
  AS document
  FROM article
  JOIN article_keywords on article.id = article_keywords.article_id
  JOIN keyword on keyword.id = article_keywords.keyword_id
  JOIN article_authors on article.id = article_authors.article_id
  JOIN author on author.id = article_authors.author_id
  JOIN article_organisations on article.id = article_organisations.article_id
  JOIN organisation on organisation.id = article_organisations.organisation_id
  GROUP BY article.article_title) AS p_search
WHERE p_search.document @@ to_tsquery('David & Feeney');

In the example above, I want to search across all tables for mentions of 'David Feeney'. There are two entries in the author table for him but so far none are returned.

Best Answer

The blog post you refer to is wrong in using [INNER] JOIN between post and posts_tags. Nothing enforces that at least one tag exists, so posts without tags are excluded from the result.
(Also, why is there no PK for posts_tags? A dubious DB design.)

You copied that error and made it much worse by combining multiple relationships that way. Each many-to-many relationship allows for the possibility of no (zero) related rows. If you join multiple tables that way, only such rows from article remain that have relatives in each and every of the joined tables.

On the other hand, you multiply rows with multiple relatives in multiple tables, resulting in a potentially huge intermediary table with lots and lots of redundant copies in the aggregated result. Detailed explanation:

This query should work with Postgres 9.1 or later:

SELECT a.article_title AS title
FROM   article a
LEFT   JOIN (
   SELECT x.article_id AS id, string_agg(y.keyword, ' ') AS txt
   FROM   article_keywords x
   JOIN   keyword          y ON y.id = x.keyword_id
   GROUP  BY 1
   ) k USING (id)
LEFT   JOIN (
   SELECT x.article_id AS id, string_agg(y.name, ' ') AS txt
   FROM   article_authors x
   JOIN   author          y ON y.id = x.author_id
   GROUP  BY 1
   ) a USING (id)
LEFT   JOIN (
   SELECT x.article_id AS id, string_agg(y.org_name, ' ') AS txt
   FROM   article_organisations x
   JOIN   organisation          y ON y.id = x.organisation_id
   GROUP  BY 1
   ) o USING (id)
WHERE  to_tsvector(concat_ws(' ', a.article_title, k.txt, a.txt, o.txt))
    @@ to_tsquery('David & Feeney');

Using LEFT [OUTER] JOIN we don't exclude rows without relatives in the joined tables.

By aggregating multiple rows before joining we avoid the cross-multiplication of rows.

I concatenate the string with concat_ws() (which ignores potential NULL values from any of the contributing sources) and call to_tsvector() only once, which should be a bit cheaper.

Besides correctly aggregating input from all related tables, this should also be among the fastest possible solutions. Aggregate first, join later.

That is - as long as you want to combine all words to search for David & Feeney. For instance, this way you get a match for an article associated with any organization that has the word 'Feeney' in its name and any author with 'David' in his / her name. That's what your query suggests. It's typically not what you'd want, though.

I would not concatenate everything for this purpose, but rather search in each table separately for the combination of both words ...

Also adding separators between word to avoid false positives from matching illegally combined words, like @Daniel hinted in his comment.

And I remove the subquery that's not needed.