Mysql – Does a Levenshtein distance involve some computation for each single row

MySQLsqlitestringstring manipulation

Let's say I have, in a SQL DB, a table with 2 columns: id integer, description text and 100,000 rows. The column description has always < 20 characters.

If I want to:

SELECT * FROM mytable WHERE levenshtein_dist(description, "hello world") < 7
  • does this require to compute the distance for each of the 100,000 rows (heavy computation!) for any implementation of Levenshtein string similarity distance, because this is inherent to the method

or

  • is there some clever implementation or clever technique (involving an index maybe?) that can avoid to have to loop over each single row of the table?

Context: I wanted to know if MySQL (or even sqlite) had an efficient mean to find string similarity. FTS (Full text search) is one but unfortunately (it's for sqlite here, but it might be the same for MySql?):

SQLite's FTS engine is based on tokens – keywords that the search engine tries to match.
A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.

The latter (the fact that "jmups" cannot be found as similar to "jumps") makes it unpractical for my use case, sadly.

Best Answer

PostgreSQL

Using PostgreSQL, you can create a trigram index on a column with the pgtrgm module and do a similarity search on it with the <-> operator,

text <-> text Returns the “distance” between the arguments, that is one minus the similarity() value.

note there is also word_similarity. I would highly suggest checking out the module, as it explicitly addresses your concerns,

Trigram matching is a very useful tool when used in conjunction with a full text index. In particular it can help to recognize misspelled input words that will not be matched directly by the full text search mechanism.