I've a database with around 1.9 Million rows. My DB details
Server: Localhost via UNIX socket Server type: Percona Server Server
version: 5.5.42-37.1 – Percona Server (GPL), Release 37.1, Revision
727 Protocol version: 10 User: ****@localhost Server charset: UTF-8
Unicode (utf8)
Currenty Using:
I'm using the following algorithm on my FULLINDEX column First, I match my required string as per the below query
SELECT title FROM my_db WHERE MATCH (`Title`) AGAINST ('my string' IN BOOLEAN MODE)
And then I use a levenstein()
distance function on the results row server side through PHP to get the closest match to the string.
My Questions:
- Would it be faster to implement a levenshtein entirely instead of a MATCH, AGAINST on a fulltext for such a huge database?
- Is there any implementable algorithm which is better than levenshtein or for that matter the best such algorithm existing right now?
- Is there any other work around instead of
MATCH
orlevenshtein()
? - How will each the below search modifiers
IN NATURAL LANGUAGE MODE,
IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION,
IN BOOLEAN MODE,
WITH QUERY EXPANSION,
enhance or optimize my searching and results? Hence, which one would be the best?
Thanks!
Best Answer
A
FULLTEXT
index is very efficient for small resultsets in a huge table. Using levenshtein involves checking each row. So, I agree with your approach to do FT as a first step.I would suggest
IN BOOLEAN MODE
but without any "+" on the words;ORDER BY MATCH... DESC
and addLIMIT
. This keeps edge cases from coming up with thousands of rows for the next step.levenshtein
distance.Keep in mind that the end result will be less than perfect, but at least it should be "fast enough".