I'm trying to set up a full text search in PostgreSQL but coming across the problem that the more tables I join to the query the fewer the results I get back.
The tables I am trying to join are:
- article
- keyword
- author
- organisation
There is a many-to-many relationship between article
and all the other tables.
I am basing my query on that found here and at a loss as to why it's failing. The query so far looks like this:
SELECT title
FROM (
SELECT article_title AS title,
to_tsvector(article.article_title) ||
to_tsvector(string_agg(keyword.keyword, '')) ||
to_tsvector(string_agg(organisation.org_name, '')) ||
to_tsvector(string_agg(author.name, ''))
AS document
FROM article
JOIN article_keywords on article.id = article_keywords.article_id
JOIN keyword on keyword.id = article_keywords.keyword_id
JOIN article_authors on article.id = article_authors.article_id
JOIN author on author.id = article_authors.author_id
JOIN article_organisations on article.id = article_organisations.article_id
JOIN organisation on organisation.id = article_organisations.organisation_id
GROUP BY article.article_title) AS p_search
WHERE p_search.document @@ to_tsquery('David & Feeney');
In the example above, I want to search across all tables for mentions of 'David Feeney'. There are two entries in the author table for him but so far none are returned.
Best Answer
The blog post you refer to is wrong in using
[INNER] JOIN
betweenpost
andposts_tags
. Nothing enforces that at least one tag exists, so posts without tags are excluded from the result.(Also, why is there no PK for
posts_tags
? A dubious DB design.)You copied that error and made it much worse by combining multiple relationships that way. Each many-to-many relationship allows for the possibility of no (zero) related rows. If you join multiple tables that way, only such rows from
article
remain that have relatives in each and every of the joined tables.On the other hand, you multiply rows with multiple relatives in multiple tables, resulting in a potentially huge intermediary table with lots and lots of redundant copies in the aggregated result. Detailed explanation:
This query should work with Postgres 9.1 or later:
Using
LEFT [OUTER] JOIN
we don't exclude rows without relatives in the joined tables.By aggregating multiple rows before joining we avoid the cross-multiplication of rows.
I concatenate the string with
concat_ws()
(which ignores potential NULL values from any of the contributing sources) and callto_tsvector()
only once, which should be a bit cheaper.Besides correctly aggregating input from all related tables, this should also be among the fastest possible solutions. Aggregate first, join later.
That is - as long as you want to combine all words to search for
David & Feeney
. For instance, this way you get a match for an article associated with any organization that has the word 'Feeney' in its name and any author with 'David' in his / her name. That's what your query suggests. It's typically not what you'd want, though.I would not concatenate everything for this purpose, but rather search in each table separately for the combination of both words ...
Also adding separators between word to avoid false positives from matching illegally combined words, like @Daniel hinted in his comment.
And I remove the subquery that's not needed.