Fulltext searchin with InnoDB in MySQL only includes certain characters referred to in the docs as "word characters" and mentioned here:
For the built-in full-text parser, you can change the set of characters that are considered word characters in several ways, as described in the following…
My question is, for a given character set, how can I determine which characters are considered "word" characters? My table is utf_general_ci. I did a bit of trial and error and found for instance, that "¬" is not treated as a word character but "Ω" is. I'm looking for either a clearly defined reference or a tool of some support so that I can find this out without doing trial and error.
Best Answer
The characters considered for forming a word, is explained in the Documentation:
Regarding Delimiters:
Besides the above, there are Storage engine specific limits as well, on the minimum word size:
Now, regarding the list of defined "True Word Characters", MySQL has given flexibilities, to be able to add/remove additional characters, for word criteria. As stated in the Documentation:
Now looking at the source code
ha_innodb.cc
here:For some simple character-sets like
latin1
, one can edit the<ctype><map>
array in the respective.xml
file: https://github.com/mysql/mysql-server/blob/5.7/sql/share/charsets/latin1.xml#L25However, utf8 being a complex character-set, it is implemented in the ctype-utf8.c file . It contains the complete character set definitions. Now, array elements are bit values. Each element describes the attributes of a single character in the character set. Each attribute is associated with a bitmask, as defined in
include/m_ctype.h
file:So, the characters having attributes associated with above bitmasks will be considered as true word character. Check out some explanation here --> https://dev.mysql.com/doc/refman/5.7/en/character-arrays.html
Finally, there is a list of stopwords (InnoDB has limited list, but MyISAM has quite a big list), which are ignored. Eg:
about
,are
,com
etc.