I took the three strings in your question and added it to a table plus three more string with pankt
instead of punkt
.
The following was executed using MySQL 5.5.12 for Windows
mysql> CREATE TABLE artikel
-> (
-> id INT NOT NULL AUTO_INCREMENT,
-> meldungstext MEDIUMTEXT,
-> PRIMARY KEY (id),
-> FULLTEXT (meldungstext)
-> ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.03 sec)
mysql> INSERT INTO artikel (meldungstext) VALUES
-> ('Punkten'),('Zwei-Punkte-Vorsprung'),('Treffpunkt'),
-> ('Pankten'),('Zwei-Pankte-Vorsprung'),('Treffpankt');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql>
I ran these queries against the table using 3 different approaches
MATCH ... AGAINST
LOCATE
as in the LOCATE function
LIKE
Please note the differences
mysql> SELECT id,meldungstext,
-> COUNT(IF(MATCH (`meldungstext`) AGAINST ('*punkt*' IN BOOLEAN MODE),1,0)) PunktMatch,
-> IF(LOCATE('punkt',meldungstext)>0,1,0) PunktLocate,
-> meldungstext LIKE '%punkt%' PunktLike
-> FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext | PunktMatch | PunktLocate | PunktLike |
+----+-----------------------+------------+-------------+-----------+
| 1 | Punkten | 1 | 1 | 1 |
| 2 | Zwei-Punkte-Vorsprung | 1 | 1 | 1 |
| 3 | Treffpunkt | 1 | 1 | 1 |
| 4 | Pankten | 1 | 0 | 0 |
| 5 | Zwei-Pankte-Vorsprung | 1 | 0 | 0 |
| 6 | Treffpankt | 1 | 0 | 0 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.01 sec)
mysql>
All the PunktMatch values should bee 3 1's and 3 0's.
Now watch me query them as normal
mysql> SELECT `meldungstext` FROM `artikel`
-> WHERE MATCH (`meldungstext`) AGAINST ('*punkt*' IN BOOLEAN MODE);
+-----------------------+
| meldungstext |
+-----------------------+
| Zwei-Punkte-Vorsprung |
| Punkten |
+-----------------------+
2 rows in set (0.01 sec)
mysql> SELECT `meldungstext` FROM `artikel`
-> WHERE LOCATE('punkt',meldungstext)>0;
+-----------------------+
| meldungstext |
+-----------------------+
| Punkten |
| Zwei-Punkte-Vorsprung |
| Treffpunkt |
+-----------------------+
3 rows in set (0.00 sec)
mysql> SELECT `meldungstext` FROM `artikel`
-> WHERE `meldungstext` LIKE '%punk%';
+-----------------------+
| meldungstext |
+-----------------------+
| Punkten |
| Zwei-Punkte-Vorsprung |
| Treffpunkt |
+-----------------------+
3 rows in set (0.00 sec)
mysql>
OK using MATCH .. AGAINST with punkt does not work. What about pankt ???
mysql> SELECT `meldungstext` FROM `artikel` WHERE `meldungstext` LIKE '%pankt%';
+-----------------------+
| meldungstext |
+-----------------------+
| Pankten |
| Zwei-Pankte-Vorsprung |
| Treffpankt |
+-----------------------+
3 rows in set (0.00 sec)
mysql>
Let's run my big GROUP BY
query against pankt
mysql> SELECT id,meldungstext,
-> COUNT(IF(MATCH (`meldungstext`) AGAINST ('*pankt*' IN BOOLEAN MODE),1,0)) PanktMatch,
-> IF(LOCATE('pankt',meldungstext)>0,1,0) PanktLocate,
-> meldungstext LIKE '%pankt%' PanktLike
-> FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext | PanktMatch | PanktLocate | PanktLike |
+----+-----------------------+------------+-------------+-----------+
| 1 | Punkten | 1 | 0 | 0 |
| 2 | Zwei-Punkte-Vorsprung | 1 | 0 | 0 |
| 3 | Treffpunkt | 1 | 0 | 0 |
| 4 | Pankten | 1 | 1 | 1 |
| 5 | Zwei-Pankte-Vorsprung | 1 | 1 | 1 |
| 6 | Treffpankt | 1 | 1 | 1 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.01 sec)
mysql>
This is wrong also because I should see 3 0's and 3 1's for PanktMatch.
I tried something else
mysql> SELECT id,meldungstext, MATCH (`meldungstext`) AGAINST ('+*pankt*' IN BOOLEAN MODE) PanktMatch, IF(LOCATE('pankt',meldungstext)>0,1,0) PanktLocate, meldungstext LIKE '%pankt%' PanktLike FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext | PanktMatch | PanktLocate | PanktLike |
+----+-----------------------+------------+-------------+-----------+
| 1 | Punkten | 0 | 0 | 0 |
| 2 | Zwei-Punkte-Vorsprung | 0 | 0 | 0 |
| 3 | Treffpunkt | 0 | 0 | 0 |
| 4 | Pankten | 1 | 1 | 1 |
| 5 | Zwei-Pankte-Vorsprung | 1 | 1 | 1 |
| 6 | Treffpankt | 0 | 1 | 1 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.00 sec)
mysql>
I added a plus sign to pankt and I got different results. What 2 and not 3 ???
According to the MySQL Documentation, notice what it says about the wildcard character:
*
The asterisk serves as the truncation (or wildcard) operator. Unlike
the other operators, it should be appended to the word to be affected.
Words match if they begin with the word preceding the * operator.
If a word is specified with the truncation operator, it is not
stripped from a boolean query, even if it is too short (as determined
from the ft_min_word_len setting) or a stopword. This occurs because
the word is not seen as too short or a stopword, but as a prefix that
must be present in the document in the form of a word that begins with
the prefix. Suppose that ft_min_word_len=4. Then a search for '+word
+the*' will likely return fewer rows than a search for '+word +the':
The former query remains as is and requires both word and the* (a word
starting with the) to be present in the document.
The latter query is transformed to +word (requiring only word to be
present). the is both too short and a stopword, and either condition
is enough to cause it to be ignored.
Based on this, the wildcard character is applicable for the back of tokens and not for the front. In light of this, the output must be correct because 2 of the 3 punkt's start tokens. Same story with pankt. This at least explains why 2 out of 3 and why less rows.
Best Answer
Yes, that is a good way to use
FULLTEXT
, especially if you need to 'cleanse' the data in any way, or gather text from related tables.There is probably not much difference between having the new column in the same table or in a separate table.
The redundancy is wasteful. However, there is no way to have any kind (
FULLTEXT
or other) of index that spans multiple tables. Might it be better to have all the columns in the same table, and doFULLTEXT(this, that, the_other)
?And the redundancy is potentially error-prone. This could be tempered by having a Stored Procedure that users must call when inserting or manipulating any of the text involved in the issue. This SP would always make the copy for you.
When using a FT index with other things, such as
the FT test will be done first. Hopefully, that will select only a small number of rows. Then the other tests will be applied to further filter the results. When the situation is reversed (
price
is more selective thanMATCH
), well, too bad.If you always have the
MATCH
, there is no need for indexingprice
, the index won't be used.