MySQL – Boolean Full-text Search

full-text-searchMySQL

I'm using MySQL's boolean full-text search for my search engine: https://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html.

Assume I have a table with the following columns:

PRODUCTS
--------
id (PK)
color_id (FK to COLORS table)
name
description

When given a search term, I want to search the product's table's name and description columns, as well as the product's color name. The color's name is in a separate table.

Should I create a new column in the product's table and put all the text I want indexed in there? For example, I add a products.full_text_index column, and dump a concatenation of name, description, and the color's name in that column? Then do full-text search against that column?

UPDATE: Or, instead of a column in an existing table, maybe create a new table to contain all the text that should be used for the full-text search. That table can just have a FK to the products table.

Best Answer

Yes, that is a good way to use FULLTEXT, especially if you need to 'cleanse' the data in any way, or gather text from related tables.

There is probably not much difference between having the new column in the same table or in a separate table.

The redundancy is wasteful. However, there is no way to have any kind (FULLTEXT or other) of index that spans multiple tables. Might it be better to have all the columns in the same table, and do FULLTEXT(this, that, the_other)?

And the redundancy is potentially error-prone. This could be tempered by having a Stored Procedure that users must call when inserting or manipulating any of the text involved in the issue. This SP would always make the copy for you.

When using a FT index with other things, such as

WHERE price > 50
  AND MATCH(stuff) AGAINST(...)

the FT test will be done first. Hopefully, that will select only a small number of rows. Then the other tests will be applied to further filter the results. When the situation is reversed (price is more selective than MATCH), well, too bad.

If you always have the MATCH, there is no need for indexing price, the index won't be used.

Related Solutions

Sql-server – get similar Full Text Search functionality without using Full-Text Search

Native full-text can handle all of the features you've mentioned.

Ranking/relevance, weighting
Word inflection and expansion
Customisable noise words and thesaurus

Whether or not it will be the best solution for your particular domain is a tough call. Rather than attempt to hand-roll an alternative, spike a proof of concept using SQL FTS and repeat the exercise with one or more open source alternatives.

Lucene is the obvious contender to compare and comes with the "Stack Overflow search is now 81% less crappy" endorsement of the SO team.

Mysql – Why full-text-search returns less rows than LIKE

I took the three strings in your question and added it to a table plus three more string with pankt instead of punkt.

The following was executed using MySQL 5.5.12 for Windows

mysql> CREATE TABLE artikel
    -> (
    ->     id INT NOT NULL AUTO_INCREMENT,
    ->     meldungstext MEDIUMTEXT,
    ->     PRIMARY KEY (id),
    ->     FULLTEXT (meldungstext)
    -> ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO artikel (meldungstext) VALUES
    -> ('Punkten'),('Zwei-Punkte-Vorsprung'),('Treffpunkt'),
    -> ('Pankten'),('Zwei-Pankte-Vorsprung'),('Treffpankt');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql>

I ran these queries against the table using 3 different approaches

MATCH ... AGAINST
LOCATE as in the LOCATE function
LIKE

Please note the differences

mysql> SELECT id,meldungstext,
    -> COUNT(IF(MATCH (`meldungstext`) AGAINST ('*punkt*' IN BOOLEAN MODE),1,0)) PunktMatch,
    -> IF(LOCATE('punkt',meldungstext)>0,1,0) PunktLocate,
    -> meldungstext  LIKE '%punkt%' PunktLike
    -> FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext          | PunktMatch | PunktLocate | PunktLike |
+----+-----------------------+------------+-------------+-----------+
|  1 | Punkten               |          1 |           1 |         1 |
|  2 | Zwei-Punkte-Vorsprung |          1 |           1 |         1 |
|  3 | Treffpunkt            |          1 |           1 |         1 |
|  4 | Pankten               |          1 |           0 |         0 |
|  5 | Zwei-Pankte-Vorsprung |          1 |           0 |         0 |
|  6 | Treffpankt            |          1 |           0 |         0 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.01 sec)

mysql>

All the PunktMatch values should bee 3 1's and 3 0's.

Now watch me query them as normal

mysql> SELECT `meldungstext` FROM `artikel`
    -> WHERE MATCH (`meldungstext`) AGAINST ('*punkt*' IN BOOLEAN MODE);
+-----------------------+
| meldungstext          |
+-----------------------+
| Zwei-Punkte-Vorsprung |
| Punkten               |
+-----------------------+
2 rows in set (0.01 sec)

mysql> SELECT `meldungstext` FROM `artikel`
    -> WHERE LOCATE('punkt',meldungstext)>0;
+-----------------------+
| meldungstext          |
+-----------------------+
| Punkten               |
| Zwei-Punkte-Vorsprung |
| Treffpunkt            |
+-----------------------+
3 rows in set (0.00 sec)

mysql> SELECT `meldungstext` FROM `artikel`
    -> WHERE `meldungstext` LIKE '%punk%';
+-----------------------+
| meldungstext          |
+-----------------------+
| Punkten               |
| Zwei-Punkte-Vorsprung |
| Treffpunkt            |
+-----------------------+
3 rows in set (0.00 sec)

mysql>

OK using MATCH .. AGAINST with punkt does not work. What about pankt ???

mysql> SELECT `meldungstext` FROM `artikel` WHERE `meldungstext` LIKE '%pankt%';
+-----------------------+
| meldungstext          |
+-----------------------+
| Pankten               |
| Zwei-Pankte-Vorsprung |
| Treffpankt            |
+-----------------------+
3 rows in set (0.00 sec)

mysql>

Let's run my big GROUP BY query against pankt

mysql> SELECT id,meldungstext,
    -> COUNT(IF(MATCH (`meldungstext`) AGAINST ('*pankt*' IN BOOLEAN MODE),1,0)) PanktMatch,
    -> IF(LOCATE('pankt',meldungstext)>0,1,0) PanktLocate,
    -> meldungstext  LIKE '%pankt%' PanktLike
    -> FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext          | PanktMatch | PanktLocate | PanktLike |
+----+-----------------------+------------+-------------+-----------+
|  1 | Punkten               |          1 |           0 |         0 |
|  2 | Zwei-Punkte-Vorsprung |          1 |           0 |         0 |
|  3 | Treffpunkt            |          1 |           0 |         0 |
|  4 | Pankten               |          1 |           1 |         1 |
|  5 | Zwei-Pankte-Vorsprung |          1 |           1 |         1 |
|  6 | Treffpankt            |          1 |           1 |         1 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.01 sec)

mysql>

This is wrong also because I should see 3 0's and 3 1's for PanktMatch.

I tried something else

mysql> SELECT id,meldungstext, MATCH (`meldungstext`) AGAINST ('+*pankt*' IN BOOLEAN MODE) PanktMatch, IF(LOCATE('pankt',meldungstext)>0,1,0) PanktLocate, meldungstext  LIKE '%pankt%' PanktLike FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext          | PanktMatch | PanktLocate | PanktLike |
+----+-----------------------+------------+-------------+-----------+
|  1 | Punkten               |          0 |           0 |         0 |
|  2 | Zwei-Punkte-Vorsprung |          0 |           0 |         0 |
|  3 | Treffpunkt            |          0 |           0 |         0 |
|  4 | Pankten               |          1 |           1 |         1 |
|  5 | Zwei-Pankte-Vorsprung |          1 |           1 |         1 |
|  6 | Treffpankt            |          0 |           1 |         1 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.00 sec)

mysql>

I added a plus sign to pankt and I got different results. What 2 and not 3 ???

According to the MySQL Documentation, notice what it says about the wildcard character:

*

The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it should be appended to the word to be affected. Words match if they begin with the word preceding the * operator.

If a word is specified with the truncation operator, it is not stripped from a boolean query, even if it is too short (as determined from the ft_min_word_len setting) or a stopword. This occurs because the word is not seen as too short or a stopword, but as a prefix that must be present in the document in the form of a word that begins with the prefix. Suppose that ft_min_word_len=4. Then a search for '+word +the*' will likely return fewer rows than a search for '+word +the':

The former query remains as is and requires both word and the* (a word starting with the) to be present in the document.

The latter query is transformed to +word (requiring only word to be present). the is both too short and a stopword, and either condition is enough to cause it to be ignored.

Based on this, the wildcard character is applicable for the back of tokens and not for the front. In light of this, the output must be correct because 2 of the 3 punkt's start tokens. Same story with pankt. This at least explains why 2 out of 3 and why less rows.

Best Answer

Related Solutions

Sql-server – get similar Full Text Search functionality without using Full-Text Search

Mysql – Why full-text-search returns less rows than LIKE

Related Question