MySQL – Differences Between utf8mb4 Binary Collations

collationMySQLutf-8

What is the difference between utf8mb4_0900_bin vs utf8mb4_bin binary collations?

Best Answer

There are three differences as far as I can tell (according to their documentation):

  1. Case-mappings (for LOWER() / UPPER() functions):

    https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-uca

    The LOWER() and UPPER() functions perform case folding according to the collation of their argument.

    The difference between the two collations in this context is that the _0900_ version, being based on a newer version of Unicode, quite likely has more mapping definitions (and possibly even some corrections).

  2. Padding vs No Padding (of trailing spaces):

    https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-pad-attributes

    The pad attribute for utf8mb4_bin is PAD SPACE, whereas for utf8mb4_0900_bin it is NO PAD. Consequently, operations involving utf8mb4_0900_bin do not add trailing spaces, and comparisons involving strings with trailing spaces may differ for the two collations

    Essentially, utf8mb4_bin ignores trailing spaces while utf8mb4_0900_bin does not ignore them. See the documentation (linked above) for an example.

  3. Sorting (performance only, not the ordering):

    https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-collating-weights

    • For _bin collations except utf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.

    • For utf8mb4_0900_bin, the weight is the utf8mb4 encoding bytes. The sort order is the same as for utf8mb4_bin, but much faster.

    Translating that into human, they are saying that for a code point such as U+FF9D, utf8mb4_bin will see the UTF-8 encoded byte sequence of EF BE 9D and convert that into 00 FF 9D. But, utf8mb4_0900_bin will not convert it into the code point value. This is due to the UTF-8 byte sequence already being sequential, hence the ordering is the same as it for the code point values. So why bother with that extra conversion step?