Mysql – How large space will be occupied by thesql for a varchar utf8 column

character-setMySQLutf-8varchar

Tip: To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.

From:http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html

Does this mean if I use a varchar utf8 column other than char utf8 column in mysql, mysql will physically store characters using 1 byte for latin and 3 bytes for CJK(Chinese、japanese、Korean) ?

Best Answer

CHAR(...) CHARACTER SET utf8 always takes 3 bytes per character -- CHAR(100) occupies 300 bytes (no length needed).

VARCHAR occupies 1-2 bytes for a length, plus only as many bytes as needed. So VARCHAR(100) with hello will occupy 7 (2+5) bytes in any character set.

Señor, in CHARACTER SET latin1, take 5 bytes (plus length). In utf8, it takes 6 bytes (plus length). This is because ñ is the 1-byte hex F1 in latin1 or the 2-byte C3B1 for utf8.

Some Chinese characters and some Emoji, need 4 bytes, so utf8mb4 is a better choice for them.

There is very little use for ever using CHAR instead of VARCHAR. They are usually for consistent-length fields such as postal code, country code, md5, uuid, etc. And all of those may as well be CHARACTER SET ascii or possibly latin1, but not utf8.

Back to your question... With CHARACTER SET utf8 (or utf8mb4):

  • English letters take 1 byte.
  • Western European accented letters take 2 bytes.
  • Greek, Hebrew, and other languages in that general area, take 2 bytes.
  • Japanese and Korean take 3 bytes.
  • Most Chinese characters are 3 bytes; some are 4 and need utf8mb4.
  • A longer list

"mysql will physically store characters using 1 byte for latin and 3 bytes for CJK" is not phrased correctly. Change 'latin' to 'English' to make it correct. ('Latin' tends to include various accented characters.)