Postgresql – Full text search on multiple joined tables

full-text-searchjoin;postgresql

I'm trying to set up a full text search in PostgreSQL but coming across the problem that the more tables I join to the query the fewer the results I get back.

The tables I am trying to join are:

article
keyword
author
organisation

There is a many-to-many relationship between article and all the other tables.

I am basing my query on that found here and at a loss as to why it's failing. The query so far looks like this:

SELECT title
FROM (
  SELECT article_title AS title, 
  to_tsvector(article.article_title) ||
  to_tsvector(string_agg(keyword.keyword, '')) ||
  to_tsvector(string_agg(organisation.org_name, '')) ||
  to_tsvector(string_agg(author.name, ''))
  AS document
  FROM article
  JOIN article_keywords on article.id = article_keywords.article_id
  JOIN keyword on keyword.id = article_keywords.keyword_id
  JOIN article_authors on article.id = article_authors.article_id
  JOIN author on author.id = article_authors.author_id
  JOIN article_organisations on article.id = article_organisations.article_id
  JOIN organisation on organisation.id = article_organisations.organisation_id
  GROUP BY article.article_title) AS p_search
WHERE p_search.document @@ to_tsquery('David & Feeney');

In the example above, I want to search across all tables for mentions of 'David Feeney'. There are two entries in the author table for him but so far none are returned.

Best Answer

The blog post you refer to is wrong in using [INNER] JOIN between post and posts_tags. Nothing enforces that at least one tag exists, so posts without tags are excluded from the result.
(Also, why is there no PK for posts_tags? A dubious DB design.)

You copied that error and made it much worse by combining multiple relationships that way. Each many-to-many relationship allows for the possibility of no (zero) related rows. If you join multiple tables that way, only such rows from article remain that have relatives in each and every of the joined tables.

On the other hand, you multiply rows with multiple relatives in multiple tables, resulting in a potentially huge intermediary table with lots and lots of redundant copies in the aggregated result. Detailed explanation:

Two SQL LEFT JOINS produce incorrect result

This query should work with Postgres 9.1 or later:

SELECT a.article_title AS title
FROM   article a
LEFT   JOIN (
   SELECT x.article_id AS id, string_agg(y.keyword, ' ') AS txt
   FROM   article_keywords x
   JOIN   keyword          y ON y.id = x.keyword_id
   GROUP  BY 1
   ) k USING (id)
LEFT   JOIN (
   SELECT x.article_id AS id, string_agg(y.name, ' ') AS txt
   FROM   article_authors x
   JOIN   author          y ON y.id = x.author_id
   GROUP  BY 1
   ) a USING (id)
LEFT   JOIN (
   SELECT x.article_id AS id, string_agg(y.org_name, ' ') AS txt
   FROM   article_organisations x
   JOIN   organisation          y ON y.id = x.organisation_id
   GROUP  BY 1
   ) o USING (id)
WHERE  to_tsvector(concat_ws(' ', a.article_title, k.txt, a.txt, o.txt))
    @@ to_tsquery('David & Feeney');

Using LEFT [OUTER] JOIN we don't exclude rows without relatives in the joined tables.

By aggregating multiple rows before joining we avoid the cross-multiplication of rows.

I concatenate the string with concat_ws() (which ignores potential NULL values from any of the contributing sources) and call to_tsvector() only once, which should be a bit cheaper.

Besides correctly aggregating input from all related tables, this should also be among the fastest possible solutions. Aggregate first, join later.

That is - as long as you want to combine all words to search for David & Feeney. For instance, this way you get a match for an article associated with any organization that has the word 'Feeney' in its name and any author with 'David' in his / her name. That's what your query suggests. It's typically not what you'd want, though.

I would not concatenate everything for this purpose, but rather search in each table separately for the combination of both words ...

Also adding separators between word to avoid false positives from matching illegally combined words, like @Daniel hinted in his comment.

And I remove the subquery that's not needed.

Related Solutions

Sql-server – Full-text search in SQL Server on multiple tables fails

Hi we can query multiple tables in a full-text search by using views ,but we some advantages and disadvantages . Here in your case you created view dbo.vw_recipe_search it seems your are not using it in altered procedure when i query it i am getting results as per intended

  SELECT *
        FROM dbo.vw_recipe_search r 
        INNER JOIN FREETEXTTABLE(vw_recipe_search, *, 'salsa peas') kt
            ON r.search_id = kt.[KEY]
        ORDER BY RANK DESC;

The problem in this approach is ranking and weightage of each column , like suppose if u want to give high weight to title column than descrip you have to do a lot of work in order by rank column . We used Levenstein string comparision function to order for better results set

Sql-server – BM25 (full-text search) implementation in SQL Server

After a few days of fiddling around, I finally managed to get this done. Here is the code I ended up with, packed into a stored procedure. For this to work, I added another table, which just holds a document ID and its length in words.

ALTER PROCEDURE [dbo].[BuildIndexBM25]
-- default parameters K1=> [1.2 - 2.0], B => [0.0 - 1.0], Weight => 3 (e.g. titles only)
    @K1 FLOAT = 1.2,
    @B FLOAT = 0.75,
    @Weight FLOAT = 3
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE @N FLOAT;
    DECLARE @AVGDL FLOAT;

    -- number of all documents
    SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;

    -- average document length (in words)
    SELECT @AVGDL = AVG(Length) FROM DocumentLength; 

    -- BM25 implementation
    -- result set: | DocumentA | DocumentB | BM25 |
    SELECT 
        DL.ID AS DocumentA,
        BM.DocumentID AS DocumentB, 
        BM.BM25 AS BM25
    FROM (
        SELECT ID FROM DocumentLength
    ) DL
    CROSS APPLY (
        SELECT
                Q.DocumentID,
                SUM(LOG((@N - X.Count + 0.5) / (X.Count + 0.5)) * (Q.Count * (@K1 + 1)) / (Q.Count + @K1 * (1 - @B + (@B * L.Length / @AVGDL)))) AS BM25
        FROM    TF D, TF Q, DF X, DocumentLength L
        WHERE   D.TermID = Q.TermID
            AND D.DocumentID = DL.ID
            AND Q.TermID = X.ID
            AND D.DocumentID = L.ID
            AND Q.Weight = @Weight 
            AND D.Weight = @Weight
        GROUP BY Q.DocumentID
    ) BM
    WHERE   DL.ID != BM.DocumentID
END

I hope this answer helps anyone who wanted to implement a full-text search algorithm by themselves. The default algorithm in SQL Server 2008 R2 has only minor differences to this implementation.

Best Answer

Related Solutions

Sql-server – Full-text search in SQL Server on multiple tables fails

Sql-server – BM25 (full-text search) implementation in SQL Server

Related Question