Mysql – Does MySQL 8 ASCII vs utf8mb4_0900_ai_ci size differ when only using ASCII characters

character-setcollationencodingMySQLmysql-8.0

If I only use only ASCII characters, will VARCHAR (255) with utf8mb4_0900_ai_ci be larger on disk than VARCHAR (255) using ASCII?

Best Answer

The Fiddle is wrong.

あ A い I う U え E お O.

is 20 characters / 40 bytes when declaring that the client is encoded in utf8 (or utf8mb4). But if you claim that that it is in latin1, it leads to Mojibake or "double-encoding", hence the 30 and 48 that Fiddle shows.

あ A い I う U え E お O.  --> E38182 41 E38184 49 E38186 55 E38188 45 E3818A 4F 2E

For further discussion of what went wrong, see "double encoding" in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored . I don't have the source code to "fix" Fiddle.

That is, E38182 is the 3 hex bytes for the HIRAGANA LETTER A:

But, if you treat E38182 (etc) as latin1, it shows as ゠A ㄠI ㆠU ㈠E ㊠O.. Then if you convert again to utf8, you get

C3A3 C281 E2809A 20 41 20 C3A3 C281 E2809E 20 49 20 ...

You can still recognize the spaces (20), A (41), I (49), etc, but the Hiragana characters have been mangled.

You don't see the double-encoding in Fiddle because the browser is 'kind enough' to 'fix' your mistake. (This makes figuring out what went wrong quite devilish.)

The Chinese hex is E683B3 E79C8B E4BB80 E9A0AD E6B885 E58FAA E582B7 E7B2BE EFBC8C E4B8AD E7BE8E E8A780 E79A84 E68EA5 E5A794 E4B8BB E58091 E8AA8D E58FAF E69893 E795AB E7AD89 E58AA9 E6B5B7 E59BA0 09

(The tab (09) at the end may be an artifict of the formatting.)

The double encoding starts with C3A6 C692 C2B3 (from EF, BC, 8C) C3A7 C593 E280B9 C3A4 C2BB E282AC C3A9 C2A0 C2AD C3A6 C2B8 E280A6

Back to the Title Question -- There are minor subtle differences, even when all you use is ascii.

You will probably not encounter any measurable difference. Here are som possibilities.

  • Certain temp table actions may hit limits sooner. (This problem existed in 5.7, but may have been more than eliminated in 8.0 by now turning VARCHAR into CHAR when building temp tables.)

  • I have yet to see a benchmark that shows that utf8mb4 collations of ascii text are or are not as fast as CHARACTER SET latin1 or ascii.

  • Index limits are shorter for CHARSET utf8mb4 than for CHARSET ascii. They are probably VARCHAR(3072) versus VARCHAR(768).