Mysql – Levenshtein vs MATCH vs Others for best MySQL string match

full-text-searchMySQL

I've a database with around 1.9 Million rows. My DB details

Server: Localhost via UNIX socket Server type: Percona Server Server
version: 5.5.42-37.1 – Percona Server (GPL), Release 37.1, Revision
727 Protocol version: 10 User: ****@localhost Server charset: UTF-8
Unicode (utf8)

Currenty Using:

I'm using the following algorithm on my FULLINDEX column First, I match my required string as per the below query

SELECT title FROM my_db WHERE MATCH (`Title`) AGAINST ('my string' IN BOOLEAN MODE) 

And then I use a levenstein() distance function on the results row server side through PHP to get the closest match to the string.

My Questions:

  1. Would it be faster to implement a levenshtein entirely instead of a MATCH, AGAINST on a fulltext for such a huge database?
  2. Is there any implementable algorithm which is better than levenshtein or for that matter the best such algorithm existing right now?
  3. Is there any other work around instead of MATCH or levenshtein() ?
  4. How will each the below search modifiers

IN NATURAL LANGUAGE MODE,
IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION,
IN BOOLEAN MODE,
WITH QUERY EXPANSION,

enhance or optimize my searching and results? Hence, which one would be the best?

Thanks!

Best Answer

A FULLTEXT index is very efficient for small resultsets in a huge table. Using levenshtein involves checking each row. So, I agree with your approach to do FT as a first step.

I would suggest

  1. Remove short words, etc, from the string. (not necessary for MyISAM since it ignores them; necessary for InnoDB, else it returns nothing)
  2. IN BOOLEAN MODE but without any "+" on the words;
  3. ORDER BY MATCH... DESC and add LIMIT. This keeps edge cases from coming up with thousands of rows for the next step.
  4. Check levenshtein distance.

Keep in mind that the end result will be less than perfect, but at least it should be "fast enough".