Postgresql – Fuzzy Matching with Postgresql 9.3

full-text-searchpostgresql

Right now I am using ts_rank (with ts_vector and ts_query) to search strings for relevance. While this is doing a great job for me so far. I was hoping to work in a bit of fuzzieness into the searching.

As it stands the search is done by a user entering their terms as a string, that I then parse into tokens. These tokens are then compared against a string. What I would like to do is use something similar to a levenshtein in these queries to allow for minor misspellings to be matched as well.

For example if the user typed in rubico instead of rubicon I would still get a match on it (because rubico only has a distance of 1 from rubicon).

Is this possible to do with Postgres' ts_rank functionality, or are there other options to allow for text searches to work with levenstheins?

Best Answer

Instead of matching the portion of the string that I wanted to fuzzy search on I instead, concatinated all of the terms into a string, and then ran the difference function (from the fuzzystrmatch module) against it. So it looked something like this.

FOR term IN terms LOOP
    term_string := term_string||' '||term;
END LOOP;

score := score + difference(term_string,string_to_compare_against);

I did need to set a threshold that needed to be passed by the score to keep from getting to many false positives.

Related Solutions

Mysql – Best way to compare strings using pre-built ngrams

I've done this in SQL Server using a modified version of Dice's Coefficient. Basically, you store a precomputed table of q-grams from your lookup data set. This table should have the primary key of the source record, the resulting q-gram, and the number of times that q-gram appears when breaking up the string.

Then when you want to do a lookup, break your input string into a similar list of q-grams (temporary table, subquery, etc.), and join those to your lookup table on the q-gram, grouping by the primary key from the lookup row (which, in the lookup table, is actually a foreign key rather than a primary key). For each matched q-gram, calculate the number of matches as the minimum cardinality between the input value and the lookup table row, e.g. the string 'aaaa' could have 'aa' * 3, and 'aaaaa' could have 'aa' * 4, thus you would have 3 matches. Double this number to get 6.

Once you've got this doubled number of matches, divide by the total number of q-grams obtained from both the input string, and the rows in the lookup table associated with a given primary key. So 'aaaa' would have 3 q-grams, and 'aaaaa' would have 4. You would calculate the similarity as 6/7. You can then filter this similarity quotient based on your desired threshold.

The nice part about this algorithm is that it's fast, since after splitting your input string, it's just a bunch of index lookups. The downside is that it doesn't put any emphasis on ordering of q-grams, so having two words out of order won't affect the match quality (assuming you're not letting q-grams cross word boundaries).

Following is some sample code from SQL Server to explain the concept, but it makes use of features not found in MySQL. You'd probably need to rewrite some of this using subqueries rather than common-table expressions.

CREATE PROCEDURE FuzzyCustomerLookup
    @name varchar(8000),
    @threshold decimal(10,4) = 0.0,
    @limit_results int = 2147483647
AS
WITH inp_q AS ( --Split input string into q-grams
    SELECT
        q.qgram,
        COUNT(*) AS cardinality
    FROM dbo.Qgrams(@name) q --Function to split string into q-grams
    GROUP BY q.qgram
),
matches AS ( --Count matches of input string against each lookup row
    SELECT
        c.row_id,
        SUM(CASE WHEN c.cardinality < i.cardinality THEN c.cardinality ELSE i.cardinality END) AS matches
    FROM CustomerQgrams c --Precomputed table of q-grams
        INNER JOIN inp_q i
            ON c.qgram = i.qgram
    GROUP BY c.row_id
),
dice AS ( --Calculate match quality
    SELECT
        matches.row_id,
        CAST(matches.matches * 2 AS decimal(10,4)) / (CAST(i.cardinality AS decimal(10,4)) + CAST(c.cardinality AS decimal(10,4))) AS similarity
    FROM matches
        INNER JOIN (
            SELECT row_id, SUM(cardinality) AS cardinality
            FROM CustomerQgrams
            GROUP BY row_id
        ) c
            ON matches.row_id = c.row_id
        CROSS JOIN (
            SELECT SUM(cardinality) AS cardinality FROM inp_q
        ) i
)
SELECT TOP (@limit_results)
    ch.CustomerNo,
    ch.CustomerName,
    dice.similarity
FROM dice
    INNER JOIN Customers ch
        ON dice.row_id = ch.row_id
WHERE dice.similarity >= @threshold
ORDER BY dice.similarity DESC

Sql-server – sql server full text search – fuzzy searching

You could try an implementation of one of the string distance functions. Here's a T-SQL implementation of the Levenshtein distance taken from here. That post does suggest there's a performance issue with the function and scalar functions generally can be bad for performance when used in queries, but worth a look. You could always try and convert it to a table function.

enter image description here

Best Answer

Related Solutions

Mysql – Best way to compare strings using pre-built ngrams

Sql-server – sql server full text search – fuzzy searching

Related Question