Mysql – Does a Levenshtein distance involve some computation for each single row

MySQLsqlitestringstring manipulation

Let's say I have, in a SQL DB, a table with 2 columns: id integer, description text and 100,000 rows. The column description has always < 20 characters.

If I want to:

SELECT * FROM mytable WHERE levenshtein_dist(description, "hello world") < 7

does this require to compute the distance for each of the 100,000 rows (heavy computation!) for any implementation of Levenshtein string similarity distance, because this is inherent to the method

is there some clever implementation or clever technique (involving an index maybe?) that can avoid to have to loop over each single row of the table?

Context: I wanted to know if MySQL (or even sqlite) had an efficient mean to find string similarity. FTS (Full text search) is one but unfortunately (it's for sqlite here, but it might be the same for MySql?):

SQLite's FTS engine is based on tokens – keywords that the search engine tries to match.
A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.

The latter (the fact that "jmups" cannot be found as similar to "jumps") makes it unpractical for my use case, sadly.

Best Answer

PostgreSQL

Using PostgreSQL, you can create a trigram index on a column with the pgtrgm module and do a similarity search on it with the <-> operator,

text <-> text Returns the “distance” between the arguments, that is one minus the similarity() value.

note there is also word_similarity. I would highly suggest checking out the module, as it explicitly addresses your concerns,

Trigram matching is a very useful tool when used in conjunction with a full text index. In particular it can help to recognize misspelled input words that will not be matched directly by the full text search mechanism.

Related Solutions

Mysql – How to use the POSIX character classes upper and lower to search text in MySQL

From the article on mysql REGEXP:

REGEXP is not case sensitive, except when used with binary strings.

A simple solution would be to use the BINARY operator:

SELECT * FROM article_body WHERE body_text REGEXP BINARY "FREE";

As to why your example doesn't work, the way I interpret it is "Any string with one or more uppercase letter followed by FREE", but of course FREE is case-insensitive there.

Mysql – Best way to compare strings using pre-built ngrams

I've done this in SQL Server using a modified version of Dice's Coefficient. Basically, you store a precomputed table of q-grams from your lookup data set. This table should have the primary key of the source record, the resulting q-gram, and the number of times that q-gram appears when breaking up the string.

Then when you want to do a lookup, break your input string into a similar list of q-grams (temporary table, subquery, etc.), and join those to your lookup table on the q-gram, grouping by the primary key from the lookup row (which, in the lookup table, is actually a foreign key rather than a primary key). For each matched q-gram, calculate the number of matches as the minimum cardinality between the input value and the lookup table row, e.g. the string 'aaaa' could have 'aa' * 3, and 'aaaaa' could have 'aa' * 4, thus you would have 3 matches. Double this number to get 6.

Once you've got this doubled number of matches, divide by the total number of q-grams obtained from both the input string, and the rows in the lookup table associated with a given primary key. So 'aaaa' would have 3 q-grams, and 'aaaaa' would have 4. You would calculate the similarity as 6/7. You can then filter this similarity quotient based on your desired threshold.

The nice part about this algorithm is that it's fast, since after splitting your input string, it's just a bunch of index lookups. The downside is that it doesn't put any emphasis on ordering of q-grams, so having two words out of order won't affect the match quality (assuming you're not letting q-grams cross word boundaries).

Following is some sample code from SQL Server to explain the concept, but it makes use of features not found in MySQL. You'd probably need to rewrite some of this using subqueries rather than common-table expressions.

CREATE PROCEDURE FuzzyCustomerLookup
    @name varchar(8000),
    @threshold decimal(10,4) = 0.0,
    @limit_results int = 2147483647
AS
WITH inp_q AS ( --Split input string into q-grams
    SELECT
        q.qgram,
        COUNT(*) AS cardinality
    FROM dbo.Qgrams(@name) q --Function to split string into q-grams
    GROUP BY q.qgram
),
matches AS ( --Count matches of input string against each lookup row
    SELECT
        c.row_id,
        SUM(CASE WHEN c.cardinality < i.cardinality THEN c.cardinality ELSE i.cardinality END) AS matches
    FROM CustomerQgrams c --Precomputed table of q-grams
        INNER JOIN inp_q i
            ON c.qgram = i.qgram
    GROUP BY c.row_id
),
dice AS ( --Calculate match quality
    SELECT
        matches.row_id,
        CAST(matches.matches * 2 AS decimal(10,4)) / (CAST(i.cardinality AS decimal(10,4)) + CAST(c.cardinality AS decimal(10,4))) AS similarity
    FROM matches
        INNER JOIN (
            SELECT row_id, SUM(cardinality) AS cardinality
            FROM CustomerQgrams
            GROUP BY row_id
        ) c
            ON matches.row_id = c.row_id
        CROSS JOIN (
            SELECT SUM(cardinality) AS cardinality FROM inp_q
        ) i
)
SELECT TOP (@limit_results)
    ch.CustomerNo,
    ch.CustomerName,
    dice.similarity
FROM dice
    INNER JOIN Customers ch
        ON dice.row_id = ch.row_id
WHERE dice.similarity >= @threshold
ORDER BY dice.similarity DESC

Best Answer

PostgreSQL

Related Solutions

Mysql – How to use the POSIX character classes upper and lower to search text in MySQL

Mysql – Best way to compare strings using pre-built ngrams

Related Question