Mysql – Why does MATCH() give Score 0, if there are less rows in the database

full-text-searchMySQL

I followed this example from the Mysql Documentation:

CREATE TABLE articles (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       title VARCHAR(200),
       body TEXT,
       FULLTEXT (title,body)
     );

INSERT INTO articles (title,body) VALUES
     ('MySQL Tutorial','DBMS stands for DataBase ...'),
     ('How To Use MySQL Well','After you went through a ...'),
     ('Optimizing MySQL','In this tutorial we will show ...'),
     ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
     ('MySQL vs. YourSQL','In the following database comparison ...'),
     ('MySQL Security','When configured properly, MySQL ...');

SELECT *, MATCH ( title, body ) AGAINST ( 'database' ) AS Score
    FROM articles
    WHERE MATCH (title,body) AGAINST ('database');
+----+-------------------+------------------------------+------------+
| id | title             | body                         | Score      |
+----+-------------------+------------------------------+------------+
|  5 | MySQL vs. YourSQL | In the following database... | 0.662664...|
|  1 | MySQL Tutorial    | DBMS stands for DataBase ... | 0.655458...|
+----+-------------------+------------------------------+------------+

this works, but if i delete all rows, but one:

DELETE FROM `articles` WHERE id >1

then the select gives Score 0:

SELECT * , MATCH ( title, body ) AGAINST ( 'database' ) AS Score FROM articles

Best Answer

With only one line left all occurances are 100% of the search, and everything above 50% is considered a stop word. My guess is that a stop word gets scored 0. There's also a stop word list (the word 'database' is not in it though) found here: http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html

From http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html you find:

A natural language search interprets the search string as a phrase in natural human language (a phrase in free text). There are no special operators. The stopword list applies. In addition, words that are present in 50% or more of the rows are considered common and do not match.

Related Solutions

Mysql – Are two indexes needed

An index can seek by a subset of characters, as long as you're searching from the left. E.g., "Inter%" can seek, "%net" will not.

However, the first character is not necessarily the character under which the article would be sorted. "The Internet" should go under "I", not "T". You probably need two fields, DisplayTitle and SortTitle; a single-character index on the latter may be worthwhile, but most likely a full-length index will be just fine.

Indexes are typically B-trees, and a seek will jump to the right location about equally quickly whether you have 10 or 100 entries per page. Scans are another matter, but I'd start with the simplest solution and add an extra index only if performance proves inadequate in practice.

Mysql – Why full-text-search returns less rows than LIKE

I took the three strings in your question and added it to a table plus three more string with pankt instead of punkt.

The following was executed using MySQL 5.5.12 for Windows

mysql> CREATE TABLE artikel
    -> (
    ->     id INT NOT NULL AUTO_INCREMENT,
    ->     meldungstext MEDIUMTEXT,
    ->     PRIMARY KEY (id),
    ->     FULLTEXT (meldungstext)
    -> ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO artikel (meldungstext) VALUES
    -> ('Punkten'),('Zwei-Punkte-Vorsprung'),('Treffpunkt'),
    -> ('Pankten'),('Zwei-Pankte-Vorsprung'),('Treffpankt');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql>

I ran these queries against the table using 3 different approaches

MATCH ... AGAINST
LOCATE as in the LOCATE function
LIKE

Please note the differences

mysql> SELECT id,meldungstext,
    -> COUNT(IF(MATCH (`meldungstext`) AGAINST ('*punkt*' IN BOOLEAN MODE),1,0)) PunktMatch,
    -> IF(LOCATE('punkt',meldungstext)>0,1,0) PunktLocate,
    -> meldungstext  LIKE '%punkt%' PunktLike
    -> FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext          | PunktMatch | PunktLocate | PunktLike |
+----+-----------------------+------------+-------------+-----------+
|  1 | Punkten               |          1 |           1 |         1 |
|  2 | Zwei-Punkte-Vorsprung |          1 |           1 |         1 |
|  3 | Treffpunkt            |          1 |           1 |         1 |
|  4 | Pankten               |          1 |           0 |         0 |
|  5 | Zwei-Pankte-Vorsprung |          1 |           0 |         0 |
|  6 | Treffpankt            |          1 |           0 |         0 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.01 sec)

mysql>

All the PunktMatch values should bee 3 1's and 3 0's.

Now watch me query them as normal

mysql> SELECT `meldungstext` FROM `artikel`
    -> WHERE MATCH (`meldungstext`) AGAINST ('*punkt*' IN BOOLEAN MODE);
+-----------------------+
| meldungstext          |
+-----------------------+
| Zwei-Punkte-Vorsprung |
| Punkten               |
+-----------------------+
2 rows in set (0.01 sec)

mysql> SELECT `meldungstext` FROM `artikel`
    -> WHERE LOCATE('punkt',meldungstext)>0;
+-----------------------+
| meldungstext          |
+-----------------------+
| Punkten               |
| Zwei-Punkte-Vorsprung |
| Treffpunkt            |
+-----------------------+
3 rows in set (0.00 sec)

mysql> SELECT `meldungstext` FROM `artikel`
    -> WHERE `meldungstext` LIKE '%punk%';
+-----------------------+
| meldungstext          |
+-----------------------+
| Punkten               |
| Zwei-Punkte-Vorsprung |
| Treffpunkt            |
+-----------------------+
3 rows in set (0.00 sec)

mysql>

OK using MATCH .. AGAINST with punkt does not work. What about pankt ???

mysql> SELECT `meldungstext` FROM `artikel` WHERE `meldungstext` LIKE '%pankt%';
+-----------------------+
| meldungstext          |
+-----------------------+
| Pankten               |
| Zwei-Pankte-Vorsprung |
| Treffpankt            |
+-----------------------+
3 rows in set (0.00 sec)

mysql>

Let's run my big GROUP BY query against pankt

mysql> SELECT id,meldungstext,
    -> COUNT(IF(MATCH (`meldungstext`) AGAINST ('*pankt*' IN BOOLEAN MODE),1,0)) PanktMatch,
    -> IF(LOCATE('pankt',meldungstext)>0,1,0) PanktLocate,
    -> meldungstext  LIKE '%pankt%' PanktLike
    -> FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext          | PanktMatch | PanktLocate | PanktLike |
+----+-----------------------+------------+-------------+-----------+
|  1 | Punkten               |          1 |           0 |         0 |
|  2 | Zwei-Punkte-Vorsprung |          1 |           0 |         0 |
|  3 | Treffpunkt            |          1 |           0 |         0 |
|  4 | Pankten               |          1 |           1 |         1 |
|  5 | Zwei-Pankte-Vorsprung |          1 |           1 |         1 |
|  6 | Treffpankt            |          1 |           1 |         1 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.01 sec)

mysql>

This is wrong also because I should see 3 0's and 3 1's for PanktMatch.

I tried something else

mysql> SELECT id,meldungstext, MATCH (`meldungstext`) AGAINST ('+*pankt*' IN BOOLEAN MODE) PanktMatch, IF(LOCATE('pankt',meldungstext)>0,1,0) PanktLocate, meldungstext  LIKE '%pankt%' PanktLike FROM `artikel` GROUP BY id,meldungstext;
+----+-----------------------+------------+-------------+-----------+
| id | meldungstext          | PanktMatch | PanktLocate | PanktLike |
+----+-----------------------+------------+-------------+-----------+
|  1 | Punkten               |          0 |           0 |         0 |
|  2 | Zwei-Punkte-Vorsprung |          0 |           0 |         0 |
|  3 | Treffpunkt            |          0 |           0 |         0 |
|  4 | Pankten               |          1 |           1 |         1 |
|  5 | Zwei-Pankte-Vorsprung |          1 |           1 |         1 |
|  6 | Treffpankt            |          0 |           1 |         1 |
+----+-----------------------+------------+-------------+-----------+
6 rows in set (0.00 sec)

mysql>

I added a plus sign to pankt and I got different results. What 2 and not 3 ???

According to the MySQL Documentation, notice what it says about the wildcard character:

*

The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it should be appended to the word to be affected. Words match if they begin with the word preceding the * operator.

If a word is specified with the truncation operator, it is not stripped from a boolean query, even if it is too short (as determined from the ft_min_word_len setting) or a stopword. This occurs because the word is not seen as too short or a stopword, but as a prefix that must be present in the document in the form of a word that begins with the prefix. Suppose that ft_min_word_len=4. Then a search for '+word +the*' will likely return fewer rows than a search for '+word +the':

The former query remains as is and requires both word and the* (a word starting with the) to be present in the document.

The latter query is transformed to +word (requiring only word to be present). the is both too short and a stopword, and either condition is enough to cause it to be ignored.

Based on this, the wildcard character is applicable for the back of tokens and not for the front. In light of this, the output must be correct because 2 of the 3 punkt's start tokens. Same story with pankt. This at least explains why 2 out of 3 and why less rows.

Best Answer

Related Solutions

Mysql – Are two indexes needed

Mysql – Why full-text-search returns less rows than LIKE

Related Question