Mysql – Does latin1 have performance benefits over utf8

MySQLmysql-5.7

I have a table in utf8 with > 80M records and one of the columns (char(6) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL) can contain just latin symbols ([a-zA-Z0-9]). Does it have the sense to convert this column into latin1?

MySQL doc says:

To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible character length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.

I made a test – created 2 tables with the same 50M records:

CREATE TABLE `t_utf8` (
  `c_1` char(6) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  UNIQUE KEY `index_t_utf8_on_c_1` (`c_1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;

CREATE TABLE `t_lat` (
  `c_1` char(6) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  UNIQUE KEY `index_t_lat_on_c_1` (`c_1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;

but MySQL says that they have almost the same size:

           Name: t_lat
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 57557700
 Avg_row_length: 30
    Data_length: 1741668352
Max_data_length: 0
   Index_length: 0
      Data_free: 2097152
 Auto_increment: NULL
      Collation: utf8_general_ci
 Create_options: row_format=COMPACT

           Name: t_utf8
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 57554528
 Avg_row_length: 31
    Data_length: 1810874368
Max_data_length: 0
   Index_length: 0
      Data_free: 3145728
 Auto_increment: NULL
      Collation: utf8_general_ci
 Create_options: row_format=COMPACT

Why does it so?

  • MySQL 5.7
  • InnoDB

P.S: I made the same test with MyISAM and got expected benefit: table with latin1 – 383Mb, utf8 – 1Gb. But why it does not work for InnoDB?

Best Answer

Short answer: Don't bother.

Long answer:

  • https://bugs.mysql.com/bug.php?id=84440 -- you may or may not save space.
  • InnoDB is inherently bigger than MyISAM.
  • Comparing characters in utf8 is slightly slower than in latin1. But you will probably not notice.
  • The tiny difference between 1741668352 abd 1810874368 is probably due to the random nature of how you build one table from the other. Or the phase of the moon, not something significant.
  • The 30 vs 31 comes from how InnoDB estimates things.

Do not use CHAR except for truly fixed-length strings. Almost always they are ascii, such as country_code, postal_code, UUID, hex, md5, etc.

Get in the habit of explicit saying ascii or utf8mb4 when you create the column/table unless you have an unusual case where you need something else.