Mysql – Which characters are considered “word” characters

character-setfull-text-searchMySQL

Fulltext searchin with InnoDB in MySQL only includes certain characters referred to in the docs as "word characters" and mentioned here:

For the built-in full-text parser, you can change the set of characters that are considered word characters in several ways, as described in the following…

My question is, for a given character set, how can I determine which characters are considered "word" characters? My table is utf_general_ci. I did a bit of trial and error and found for instance, that "¬" is not treated as a word character but "Ω" is. I'm looking for either a clearly defined reference or a tool of some support so that I can find this out without doing trial and error.

Best Answer

The characters considered for forming a word, is explained in the Documentation:

The MySQL FULLTEXT implementation regards any sequence of true word characters (letters, digits, and underscores) as a word. That sequence may also contain apostrophes ('), but not more than one in a row. This means that aaa'bbb is regarded as one word, but aaa''bbb is regarded as two words. Apostrophes at the beginning or the end of a word are stripped by the FULLTEXT parser; 'aaa'bbb' would be parsed as aaa'bbb.

Regarding Delimiters:

The built-in FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example, (space), , (comma), and . (period).

Besides the above, there are Storage engine specific limits as well, on the minimum word size:

Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM.

Now, regarding the list of defined "True Word Characters", MySQL has given flexibilities, to be able to add/remove additional characters, for word criteria. As stated in the Documentation:

Suppose that you want to treat the hyphen character ('-') as a word character. Use one of these methods:

Modify the MySQL source: In storage/innobase/handler/ha_innodb.cc (for InnoDB), or in storage/myisam/ftdefs.h (for MyISAM), see the true_word_char() and misc_word_char() macros. Add '-' to one of those macros and recompile MySQL.

Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes.

Now looking at the source code ha_innodb.cc here:

#define true_word_char(c, ch) ((c) & (_MY_U | _MY_L | _MY_NMR) || (ch) == '_')

For some simple character-sets like latin1, one can edit the <ctype><map> array in the respective .xml file: https://github.com/mysql/mysql-server/blob/5.7/sql/share/charsets/latin1.xml#L25

However, utf8 being a complex character-set, it is implemented in the ctype-utf8.c file . It contains the complete character set definitions. Now, array elements are bit values. Each element describes the attributes of a single character in the character set. Each attribute is associated with a bitmask, as defined in include/m_ctype.h file:

#define _MY_U   01  /* Upper case */
#define _MY_L   02  /* Lower case */
#define _MY_NMR 04  /* Numeral (digit) */

So, the characters having attributes associated with above bitmasks will be considered as true word character. Check out some explanation here --> https://dev.mysql.com/doc/refman/5.7/en/character-arrays.html

Finally, there is a list of stopwords (InnoDB has limited list, but MyISAM has quite a big list), which are ignored. Eg: about, are, com etc.

Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overridden by a user-defined list.

Related Solutions

Determine Oracle session client character set

I am a little doubtful that this is exactly what you are looking for, but

host echo %nls_lang%;

ENGLISH_UNITED KINGDOM.WE8ISO8859P1

shows the client nls_lang environment variable on the client.

I don't think there will be a SQL query you can run to give the 'current' setting because AFAIK the server is not aware of what translation is done client-side, so any command to show the current setting will have to be native to the client - I used SQL Developer for the above command, but I assume it will work the same in SQL*Plus

--edit

from AskTom:

only the client knows their character set as well -- it is not available "in the database"

and

the character set describes what is stored in database.

the client makes their desired translated to character know [sic] to the database via the NLS_LANG settting.

If you were on 11.1+, you might have some joy with v$session_connect_info, because:

This information is pushed by OCI to the server ats login time.

But I discovered it would still depend on how you are connecting, eg from the JDBC Thin Driver you aren't using OCI and so the information isn't pushed

Mysql – When are we supposed to set default-character-set for the client

You have the option of setting character sets/collations on the fly:

While trial-and-error may be necessary, don't go willy-nilly on these variables in /etc/my.cnf. You are better off setting them dynamically during any trial-and-error testing.

To make sure of any corner cases, look at the initial variables of any mysqldump and see if character sets or collations are set in the beginning and reset at the bottom.

In fact, here is a sample for a mysqldump:

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

See the variables

@OLD_CHARACTER_SET_CLIENT
@OLD_CHARACTER_SET_RESULTS
@OLD_COLLATION_CONNECTION
SET NAMES is hardwired to utf8

You could perhaps set these options

when calling the mysqldump
place these options in /etc/my.cnf under [mysqldump] group section
edit these values for existing mysqldumps using perl, awk, etc.

Best Answer

Related Solutions

Determine Oracle session client character set

Mysql – When are we supposed to set default-character-set for the client

Related Question