Mysql – Which characters are considered “word” characters

character-setfull-text-searchMySQL

Fulltext searchin with InnoDB in MySQL only includes certain characters referred to in the docs as "word characters" and mentioned here:

For the built-in full-text parser, you can change the set of characters that are considered word characters in several ways, as described in the following…

My question is, for a given character set, how can I determine which characters are considered "word" characters? My table is utf_general_ci. I did a bit of trial and error and found for instance, that "¬" is not treated as a word character but "Ω" is. I'm looking for either a clearly defined reference or a tool of some support so that I can find this out without doing trial and error.

Best Answer

The characters considered for forming a word, is explained in the Documentation:

The MySQL FULLTEXT implementation regards any sequence of true word characters (letters, digits, and underscores) as a word. That sequence may also contain apostrophes ('), but not more than one in a row. This means that aaa'bbb is regarded as one word, but aaa''bbb is regarded as two words. Apostrophes at the beginning or the end of a word are stripped by the FULLTEXT parser; 'aaa'bbb' would be parsed as aaa'bbb.

Regarding Delimiters:

The built-in FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example, (space), , (comma), and . (period).

Besides the above, there are Storage engine specific limits as well, on the minimum word size:

Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM.

Now, regarding the list of defined "True Word Characters", MySQL has given flexibilities, to be able to add/remove additional characters, for word criteria. As stated in the Documentation:

Suppose that you want to treat the hyphen character ('-') as a word character. Use one of these methods:

  • Modify the MySQL source: In storage/innobase/handler/ha_innodb.cc (for InnoDB), or in storage/myisam/ftdefs.h (for MyISAM), see the true_word_char() and misc_word_char() macros. Add '-' to one of those macros and recompile MySQL.

  • Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes.

Now looking at the source code ha_innodb.cc here:

#define true_word_char(c, ch) ((c) & (_MY_U | _MY_L | _MY_NMR) || (ch) == '_')

For some simple character-sets like latin1, one can edit the <ctype><map> array in the respective .xml file: https://github.com/mysql/mysql-server/blob/5.7/sql/share/charsets/latin1.xml#L25

However, utf8 being a complex character-set, it is implemented in the ctype-utf8.c file . It contains the complete character set definitions. Now, array elements are bit values. Each element describes the attributes of a single character in the character set. Each attribute is associated with a bitmask, as defined in include/m_ctype.h file:

#define _MY_U   01  /* Upper case */
#define _MY_L   02  /* Lower case */
#define _MY_NMR 04  /* Numeral (digit) */

So, the characters having attributes associated with above bitmasks will be considered as true word character. Check out some explanation here --> https://dev.mysql.com/doc/refman/5.7/en/character-arrays.html

Finally, there is a list of stopwords (InnoDB has limited list, but MyISAM has quite a big list), which are ignored. Eg: about, are, com etc.

Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overridden by a user-defined list.