MySQL – Server collation utf8_unicode_ci vs table collation utf8_bin: compatibility and performance

collationlinuxmysql-5.5performanceutf-8

I'm not sure whether this belongs to SE DBA, but it sounds more logical …

Our MySQL 5.5.32 on Ubuntu 12.04 64 uses utf8_unicode_ci for the Server collation and some tables, like phpBB3, uses utf8_bin for all of their tables. This has been since start.

I read utf8_bin is (at least in theory) faster since no conversion and/or normalization is done, but are these quite different UTF8 thingies fully compatible ? If they are, is there performance hit or improvement ? The tables I'm planning switching from utf8_unicode_ci to utf8_bin contain integers and dates, so not having to bother with much larger UTF8 should drop at least 1ms when dealing with these large ASCII-only tables.

Before I had all tables not already using UTF8 Unicode Ci or UTF8 Binary I was using ASCII Binary for these tables, but decided to switch to UTF8 Unicode Ci for full compatiblity with Server collation (and of course a bit of performance improvement).

Best Answer

By using "UTF8 thingies" I think that the problem here would be concepts. Let's review them:

Encoding (badly called character set in MySQL): utf-8 (utf8 and utf8mb4), utf-16 (utf16), 7-bit ASCII (ascii)... Those are the actual formats in which the characters are stored, transmitted or converted. They can use (in MySQL, from 1 byte to 4, and some are dynamic, while others are fixed in size)
Collation: these are the default algorithms that, when nothing is explicitly said, MySQL will use to compare 2 strings. the _bin one basically compare bit-to-bit, while the _ci and _cs will compare for a particular language, in a case-insensitive or case-sensitive way.

Now we have presented the concepts, let's apply them: First, you should use the most appropriate format for your application. If your application needs to store international names, you definitely want to use a multi-byte encoding. And given that even the most local business are globalised something like utf8 is practically a must. However, there are some cases in which has a lot of sense to use different encodings. Storing something like a car plate, may only require ascii, which not only is cheaper in size, but would prevent from other strange characters to get in. Other examples include Wikipedia, a high-traffic site that has to support perfectly tons of different languages. They decided (I suppose partly to overcome MySQL limitation, partly for efficiency) to store all the pages without encoding, in pure blobs.

Regarding efficiency: having the same encoding in all the tiers of your application is usually the best policy: you avoid unnecessary conversions back and forth. Smaller and fixed size encodings are supposed to be faster but never compromise functionality. Sometimes the bottleneck on encodings can be a slow conversion, sometimes on storing many zeros. Compression can always help with the last one, specially with text.

And now about collation: basically use the one you will need by default- it will affect the string comparisons and order of the sorts. Different collations do not change the bits for each row, but it (may) change the order between them (is 'a' before 'A' or are they equivalent?). Although you can force a different collation for particular queries manually:

mysql> SELECT mycolumn FROM mytable WHERE mycolumn LIKE 'a%' COLLATE utf8_bin;
mysql> SELECT mycolumn FROM mytable ORDER BY mycolumn COLLATE utf8_bin;

Why you may have seem many people saying that the _bin version may be faster? Well, because it only has to compare bits, and not have into account the particularities of each language. If you are not going to ever do comparisons in which a == A and even a == A == á == ä, use the binary collation. But in those cases, you probably do faster by not storing text in char/varchar/text, but in the binary equivalents: varbinary/binary/blob. Of course, if you want to get the advantages of storing characters and not bytes, like getting those comparisons done automatically done for you, use utf8_general_ci or utf8_unicode_ci, which will work for most languages well.

Given that most of your data is ASCII, the size in utf8 shouldn't have changed much. Check your performance- I have suffered some weird utf8 bugs, but usually this shouldn't be a common problem unless having to deal with lots of 4-byte utf-8 characters. Given that most of your data is ascii-like, and that cpu is usually not the bottleneck (and IO will probably won't change), you may not see any change, so use the combination that better fits your needs. But test it before and after to see the change.

Remember that there are several configurations on mysql: the character set for the connections, the one for the table storage and the one in which they are sent back to the client. You can check both the encoding and the collations of a particular session by doing:

mysql> show session variables like '%character\_set\_%';
mysql> show session variables like '%collation\_%';

Some of them are not that important, like the internal format of the identifiers.

Finally, remember that the configuration of the charset and collation for databases and tables are defaults, and that changing those don't modify the actual column encoding. Also, be careful when ALTERing the tables, as you block writes to that table.

Related Solutions

Mysql – What’s the differences between utf8_general_ci and utf8_unicode_ci and utf8_binary collation in MySQL

According to the MySQL Documentation

Collations have these general characteristics:

Two different character sets cannot have the same collation.

Each character set has one collation that is the default collation. For example, the default collation for latin1 is latin1_swedish_ci. The output for SHOW CHARACTER SET indicates which collation is the default for each displayed character set.

There is a convention for collation names: They start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary).

In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.

StackOverflow has a list of questions tagged utf-8 and collation

ServerFault only has one tagged utf-8 and collation

There is a website called efreedom.com that has links all around StackOverflow concerning utf8 : http://efreedom.com/Question/1-4784168/Change-Collation-Utf8-Bin-One-Go

Here is another site about collations as its place in the MySQL World : http://www.collation-charts.org/

Here is a link explaining binary collations : http://dev.mysql.com/doc/refman/5.0/en/charset-binary-collations.html

MySQL Character set and Collation Issue.

Based on your default database encoding, character_set_database=latin1, you should force your client to use latin1. Your client is currently in utf8 and the database appears to be in latin1. This assumes that your table is also using latin1.

To make sure your data is stored correctly in MySQL, you can force the mysql client to use latin1.

1 Force the Client to use latin1

mysql -u root -p --default-character-set=latin1 nameofthedatabase

Then run your SELECT statement to verify the character formatting

select budgetID,StartDate,modifiedBy from Table_name order by budgetID desc limit 10

2 Backup the database using latin1

You can also dump the database using the same option

mysqldump -u root -p --default-character-set=latin1 nameofthedatabase > nameofthedatabase.sql

Then you could view the dump file nameofthedatabase.sql using a standard text editor like textpad or gedit to see if the characters are encoded correctly.

Collation is a set of rules that determine how rows are sorted and how they are compared. Collation does not affect the encoding which is the problem you are currently having.

If both options still give you weird symbols, the import file may not have been correctly formatted in the first place.

Best Answer

Related Solutions

Mysql – What’s the differences between utf8_general_ci and utf8_unicode_ci and utf8_binary collation in MySQL

MySQL Character set and Collation Issue.

Related Question