Sql-server – BM25 (full-text search) implementation in SQL Server

full-text-searchsql serversql-server-2008-r2

I've hit a stump, while trying to implement the BM25 algorithm in SQL Server 2008 R2. I know that SQL Server includes the Full-Text Search option, which already implements a variant of BM25, but I would like to do some parameter tuning tests and since the FTS procedures are non-editable (as far as I know), I've decided to implement it myself.

I have two tables, TF (term frequency) and DF (document frequency) with the following structures:

TF

*Note: the weight column denotes importance of the word (it is usually 1)

ID | Term | DocumentID | Count | TermID | Weight*

DF

ID | Term | Count

The TF table contains the relationship between a term and a document; that is, the frequency of the term in a document. The DF table contains information about how many documents contain a term. The two tables can be linked using DF.ID and TF.TermID. Using these two tables I would now like to compute BM25 similarity values between two documents (one document acts as a query) according to the formula in the Wikipedia article. The tables TF and DF translate to functions f(q, D) and n(q) respectively:

enter image description here

enter image description here

I would like the result to be in this format:

DocumentA_ID | DocumentB_ID | BM25_Value

Here is some code that I have up to now:

DECLARE @N FLOAT;
DECLARE @AVGDL FLOAT;
DECLARE @K1 FLOAT;
DECLARE @B FLOAT;
SET @K1 = 1.2;
SET @B = 0.75;

-- number of all documents
SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;

-- average document length (in words)
SELECT @AVGDL = AVG(DocumentLength) FROM (SELECT DocumentID, SUM(TF.Count) AS DocumentLength FROM TF WHERE Weight = 3 GROUP BY DocumentID) A;


-- BM25 implementation
SELECT  D.DocumentID AS DocumentA, 
        Q.DocumentID AS DocumentB, 
        *, 
-- need help here (SUM or something ...)
        LOG((@N - DF.Count + 0.5)/(DF.Count + 0.5)) AS IDF
FROM TF AS D
        INNER JOIN TF AS Q ON D.Term = Q.Term 
        INNER JOIN DF ON D.TermID = DF.ID
WHERE   D.DocumentID <> Q.DocumentID 

I'm having trouble constructing the query in the last section (BM25 implementation) to get the desired result format. Any help would be greatly appreciated.

Best Answer

After a few days of fiddling around, I finally managed to get this done. Here is the code I ended up with, packed into a stored procedure. For this to work, I added another table, which just holds a document ID and its length in words.

ALTER PROCEDURE [dbo].[BuildIndexBM25]
-- default parameters K1=> [1.2 - 2.0], B => [0.0 - 1.0], Weight => 3 (e.g. titles only)
    @K1 FLOAT = 1.2,
    @B FLOAT = 0.75,
    @Weight FLOAT = 3
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE @N FLOAT;
    DECLARE @AVGDL FLOAT;

    -- number of all documents
    SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;

    -- average document length (in words)
    SELECT @AVGDL = AVG(Length) FROM DocumentLength; 

    -- BM25 implementation
    -- result set: | DocumentA | DocumentB | BM25 |
    SELECT 
        DL.ID AS DocumentA,
        BM.DocumentID AS DocumentB, 
        BM.BM25 AS BM25
    FROM (
        SELECT ID FROM DocumentLength
    ) DL
    CROSS APPLY (
        SELECT
                Q.DocumentID,
                SUM(LOG((@N - X.Count + 0.5) / (X.Count + 0.5)) * (Q.Count * (@K1 + 1)) / (Q.Count + @K1 * (1 - @B + (@B * L.Length / @AVGDL)))) AS BM25
        FROM    TF D, TF Q, DF X, DocumentLength L
        WHERE   D.TermID = Q.TermID
            AND D.DocumentID = DL.ID
            AND Q.TermID = X.ID
            AND D.DocumentID = L.ID
            AND Q.Weight = @Weight 
            AND D.Weight = @Weight
        GROUP BY Q.DocumentID
    ) BM
    WHERE   DL.ID != BM.DocumentID
END

I hope this answer helps anyone who wanted to implement a full-text search algorithm by themselves. The default algorithm in SQL Server 2008 R2 has only minor differences to this implementation.