I've hit a stump, while trying to implement the BM25 algorithm in SQL Server 2008 R2. I know that SQL Server includes the Full-Text Search option, which already implements a variant of BM25, but I would like to do some parameter tuning tests and since the FTS procedures are non-editable (as far as I know), I've decided to implement it myself.
I have two tables, TF (term frequency) and DF (document frequency) with the following structures:
TF
*Note: the weight column denotes importance of the word (it is usually 1)
ID | Term | DocumentID | Count | TermID | Weight*
DF
ID | Term | Count
The TF table contains the relationship between a term and a document; that is, the frequency of the term in a document. The DF table contains information about how many documents contain a term. The two tables can be linked using DF.ID and TF.TermID. Using these two tables I would now like to compute BM25 similarity values between two documents (one document acts as a query) according to the formula in the Wikipedia article. The tables TF and DF translate to functions f(q, D) and n(q) respectively:
I would like the result to be in this format:
DocumentA_ID | DocumentB_ID | BM25_Value
Here is some code that I have up to now:
DECLARE @N FLOAT;
DECLARE @AVGDL FLOAT;
DECLARE @K1 FLOAT;
DECLARE @B FLOAT;
SET @K1 = 1.2;
SET @B = 0.75;
-- number of all documents
SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;
-- average document length (in words)
SELECT @AVGDL = AVG(DocumentLength) FROM (SELECT DocumentID, SUM(TF.Count) AS DocumentLength FROM TF WHERE Weight = 3 GROUP BY DocumentID) A;
-- BM25 implementation
SELECT D.DocumentID AS DocumentA,
Q.DocumentID AS DocumentB,
*,
-- need help here (SUM or something ...)
LOG((@N - DF.Count + 0.5)/(DF.Count + 0.5)) AS IDF
FROM TF AS D
INNER JOIN TF AS Q ON D.Term = Q.Term
INNER JOIN DF ON D.TermID = DF.ID
WHERE D.DocumentID <> Q.DocumentID
I'm having trouble constructing the query in the last section (BM25 implementation) to get the desired result format. Any help would be greatly appreciated.
Best Answer
After a few days of fiddling around, I finally managed to get this done. Here is the code I ended up with, packed into a stored procedure. For this to work, I added another table, which just holds a document ID and its length in words.
I hope this answer helps anyone who wanted to implement a full-text search algorithm by themselves. The default algorithm in SQL Server 2008 R2 has only minor differences to this implementation.