Mysql – (non-binary) MySQL collation that doesn’t treat different mathematical symbols as the same character

collationmysql-5.5unicode

I've run into a real headache with MySQL's collations and non-BMP characters (ones with Unicode codepoints above U+FFFF).

Basically, given a table and data like:

CREATE TABLE `math` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `symbols` varchar(32) character set utf8mb4 not null,
  PRIMARY KEY (`id`),
  UNIQUE KEY `symbols` (`symbols`)
);
INSERT INTO `math` VALUES (1,'?');

(You may not have a font to display the character in the string literal above. It's U+1D542 MATHEMATICAL DOUBLE-STRUCK CAPITAL K)

things look OK:

mysql> select * from math;
+----+---------+
| id | symbols |
+----+---------+
|  1 | ?       |
+----+---------+
1 row in set (0.00 sec)

mysql> select * from math where symbols = '?';
+----+---------+
| id | symbols |
+----+---------+
|  1 | ?       |
+----+---------+
1 row in set (0.00 sec)

So far so good. But then there's this crap:

mysql> select * from math where symbols = '?';
+----+---------+
| id | symbols |
+----+---------+
|  1 | ?       |
+----+---------+
1 row in set (0.00 sec)

and

mysql> INSERT INTO `math` VALUES (2,'?');
ERROR 1062 (23000): Duplicate entry '?' for key 'symbols'

(The string literal above has U+1D543 MATHEMATICAL DOUBLE-STRUCK CAPITAL L. Note that MySQL's error message has a ?, but the U+1D542 in the results of the SELECT above does dispaly correctly for me, so there don't seem to be encoding issues as far as IO with the server.)

(Code above updated; it originally had 1 for the primary key, which fails for obvious reasons.)

Screenshot for those with font issues: enter image description here

So, MySQL thinks these two characters are the same? I know it case-folds, but this isn't a matter of casing.

Needless to say, I didn't even realize I had this problem till it came up in production, because the real-world data involved rarely differs on only these characters. However, this is totally unacceptable collation behavior.

Switching to the binary collation does fix it, however I'm using Django to access the database, and when I use a binary collation it then gives me bytes instead of characters (I can decode them myself, but that's a big pain).

I'm guessing the issue has something to do with these character being outside the BMP, but it still surprising bad behavior.

Is there a way to get MySQL to use sensible collation, short of writing and installing one myself?

I suspect the non-BMP characters are the problem, since I also tried:

⛇ (U+26C7 BLACK SNOWMAN) (a BMP character) works fine.
? (U+1F300 CYCLONE) (a SMP character) gives the same error as above
? (U+1F0A1 PLAYING CARD ACE OF SPADES) (SMP) same error
? (U+20003 CJK UNIFIED IDEOGRAPH-20003) (SIP) same error

(I can't link these to codepoints.net, since I don't have enough reputation. Should be fairly obvious what their URLs are though.)

I'm using MySQL 5.5.40 on Ubuntu 14.04.

Best Answer

From 10.1.14.1 Unicode Character Sets in the MySQL 5.5 Reference Manual (emphasis added):

For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA collations, their collating weight is 0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.

and:

The current rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters.

So the answer to your question appears to be "no".

finally:

If you really want rows sorted by MySQL's rule and secondarily by code point value, it is easy:

ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin

Though it is not clear to me how this should be applied for comparisons.

The documentation extracts quoted above are unchanged for MySQL 5.7.

QUESTION #1

Why are there different levels of MySQL collation/charsets?

ANSWER TO QUESTION #1

There are two good reasons for different character sets and collations

Reason #1 : Disk Space

When you run this query

SELECT
    maxlen,
    GROUP_CONCAT(CHARACTER_SET_NAME) CharSets,
    COUNT(1) CharSetCount
FROM information_schema.character_sets
GROUP BY maxlen\G

You get this:

mysql> SELECT
    ->     maxlen,
    ->     GROUP_CONCAT(CHARACTER_SET_NAME) CharSets,
    ->     COUNT(1) CharSetCount
    -> FROM information_schema.character_sets
    -> GROUP BY maxlen\G
*************************** 1. row ***************************
      maxlen: 1
    CharSets: cp1257,cp850,binary,koi8r,latin2,ascii,tis620,koi8u,greek,armscii8,keybcs2,macroman,latin7,cp1251,cp1256,dec8,hp8,geostd8,latin1,swe7,hebrew,cp1250,latin5,cp866,macce,cp852
CharSetCount: 26
*************************** 2. row ***************************
      maxlen: 2
    CharSets: big5,cp932,sjis,gbk,ucs2,euckr,gb2312
CharSetCount: 7
*************************** 3. row ***************************
      maxlen: 3
    CharSets: eucjpms,ujis,utf8
CharSetCount: 3
*************************** 4. row ***************************
      maxlen: 4
    CharSets: utf16,utf32,utf8mb4
CharSetCount: 3
4 rows in set (0.00 sec)

mysql>

Some character sets have a Maximum Length of 1 byte to represent a character. Other need more. Give this information, you may want to refrain from using the eucjpms, ujis, utf8, utf16, utf32, utf8mb4 character sets so that VARCHAR and TEXT data takes less space on disk.

Reason #2 : Internationalization

Characters Sets Each Come With One or More Collations to cover a variety of Languages

When you run this query

SELECT
    A.CHARACTER_SET_NAME,
    GROUP_CONCAT(COLLATION_NAME) Collations,
    COUNT(1) CollationCount
FROM
    information_schema.character_sets A
    INNER JOIN information_schema.collations B
    USING (CHARACTER_SET_NAME)
GROUP BY A.CHARACTER_SET_NAME\G

You will see that some Characters Sets have with multiple collations for Different Parts of Europe. Chinese, Japanese, Greek, and parts of Asia Minor and Scandinavia are also available.

QUESTION #2

Should you always ensure your PHP connection matches the charset of the database you're working on?

ANSWER TO QUESTION #2

SCENARIO

You are driving at 3:00 AM. You are the only driver on the road. You come to an intersection. You have the red light.

Question : Do you stop or go through the red light?

Answer : Depends on the neighborhood

Safe neighborhood ?
- Some abide by the law, stop at the red, and wait for green.
- Some chance it and go through
Bad neighborhood or new to the area ?
- Some abide by the law, stop at the red, and wait for green AT THE RISK OF A CARJACKING
- Some chance it and go through to AVOID OR REDUCE RISK OF A CARJACKING
- Assume the worst and find another route

How does this apply?

You should err on the side of caution. You should always check the charset beforehand because you do not know the neighborhood (client program, internet browser) the PHP connection will be entering and if there is a risk of a carjacking (putting invalid data into the database, requesting too much data for retrieval).

QUESTION #3

If you can have different tables that use different character sets do you just use SET NAMES or mysql(i)_set_charset to switch?

ANSWER TO QUESTION #3

By all means

QUESTION #4

If you have a table that has multiple charsets how do you manage that since the connection can only use one charset at a time?

ANSWER TO QUESTION #4

You may have to shift character sets with the DB Session. Here are the settings that can be changed at the session level:

Please set these carefully before reading from and writing to the database. It would also be wise to store the character set name and collation in the same table you will be accessing.