Sql-server – Multiple charsets and Collations for a Multinational Database

collationsql-server-2008-r2

After decades of business speaking with companies that adhere to the general Latin1 Collation, my company is facing the issue of storing information in a different charset and collation: Greek.
So, it's time to start thinking a sort of redesign of our dbs.

Given that my installation is a MS SQLServer 2008 R2, what are the best methods, or general accepted guide lines, to do such a thing?
Multiple tables? Multiple Dbs with different settings?

I'm not a DBA, I'm only asking to have a starting point to mumble upon.

Thank you very much for reading and to all who will care to reply.

Best Answer

Quote from this MS tech-page:

If the users of your instance of SQL Server speak multiple languages, you should pick a collation that best supports the requirements of the various languages. For example, if the users generally speak western European languages, choose the Latin1_General collation. When you support users who speak multiple languages, it is most important to use the Unicode data types, nchar, nvarchar, and ntext, for all character data. Unicode was designed to eliminate the code page conversion difficulties of the non-Unicode char, varchar, and text data types. Collation still makes a difference when you implement all columns using Unicode data types because it defines the sort order for comparisons and sorts of Unicode characters. Even when you store your character data using Unicode data types you should pick a collation that supports most of the users in case a column or variable is implemented using the non-Unicode data types.

So, just like @Gonsalu said in a comment to @TechiGurl: build your database for Unicode to support multiple languages.

In practice this means using nchar/nvarchar/ntext datatypes, and not char/varchar/text.

Choosing a collation is a matter of query-optimization. Any collation will order your text in a definitive manner. Collations specify the rules for how strings of character data are sorted and compared, based on the norms of particular languages and locales. Thus you should choose a collation that best serves the language-requirements of most of the queries being run on your database.

In the case given by the OP, if Greek-Text now makes up 10% of all data, then I would stay with the Latin1 Collation. If there is more Greek-Language data or most of the queries being run on the database retrieve Greek-Data, then I would go with Greek_ Collation.

Here is a list of default MS Collations.

QUESTION #1

Why are there different levels of MySQL collation/charsets?

ANSWER TO QUESTION #1

There are two good reasons for different character sets and collations

Reason #1 : Disk Space

When you run this query

SELECT
    maxlen,
    GROUP_CONCAT(CHARACTER_SET_NAME) CharSets,
    COUNT(1) CharSetCount
FROM information_schema.character_sets
GROUP BY maxlen\G

You get this:

mysql> SELECT
    ->     maxlen,
    ->     GROUP_CONCAT(CHARACTER_SET_NAME) CharSets,
    ->     COUNT(1) CharSetCount
    -> FROM information_schema.character_sets
    -> GROUP BY maxlen\G
*************************** 1. row ***************************
      maxlen: 1
    CharSets: cp1257,cp850,binary,koi8r,latin2,ascii,tis620,koi8u,greek,armscii8,keybcs2,macroman,latin7,cp1251,cp1256,dec8,hp8,geostd8,latin1,swe7,hebrew,cp1250,latin5,cp866,macce,cp852
CharSetCount: 26
*************************** 2. row ***************************
      maxlen: 2
    CharSets: big5,cp932,sjis,gbk,ucs2,euckr,gb2312
CharSetCount: 7
*************************** 3. row ***************************
      maxlen: 3
    CharSets: eucjpms,ujis,utf8
CharSetCount: 3
*************************** 4. row ***************************
      maxlen: 4
    CharSets: utf16,utf32,utf8mb4
CharSetCount: 3
4 rows in set (0.00 sec)

mysql>

Some character sets have a Maximum Length of 1 byte to represent a character. Other need more. Give this information, you may want to refrain from using the eucjpms, ujis, utf8, utf16, utf32, utf8mb4 character sets so that VARCHAR and TEXT data takes less space on disk.

Reason #2 : Internationalization

Characters Sets Each Come With One or More Collations to cover a variety of Languages

When you run this query

SELECT
    A.CHARACTER_SET_NAME,
    GROUP_CONCAT(COLLATION_NAME) Collations,
    COUNT(1) CollationCount
FROM
    information_schema.character_sets A
    INNER JOIN information_schema.collations B
    USING (CHARACTER_SET_NAME)
GROUP BY A.CHARACTER_SET_NAME\G

You will see that some Characters Sets have with multiple collations for Different Parts of Europe. Chinese, Japanese, Greek, and parts of Asia Minor and Scandinavia are also available.

QUESTION #2

Should you always ensure your PHP connection matches the charset of the database you're working on?

ANSWER TO QUESTION #2

SCENARIO

You are driving at 3:00 AM. You are the only driver on the road. You come to an intersection. You have the red light.

Question : Do you stop or go through the red light?

Answer : Depends on the neighborhood

Safe neighborhood ?
- Some abide by the law, stop at the red, and wait for green.
- Some chance it and go through
Bad neighborhood or new to the area ?
- Some abide by the law, stop at the red, and wait for green AT THE RISK OF A CARJACKING
- Some chance it and go through to AVOID OR REDUCE RISK OF A CARJACKING
- Assume the worst and find another route

How does this apply?

You should err on the side of caution. You should always check the charset beforehand because you do not know the neighborhood (client program, internet browser) the PHP connection will be entering and if there is a risk of a carjacking (putting invalid data into the database, requesting too much data for retrieval).

QUESTION #3

If you can have different tables that use different character sets do you just use SET NAMES or mysql(i)_set_charset to switch?

ANSWER TO QUESTION #3

By all means

QUESTION #4

If you have a table that has multiple charsets how do you manage that since the connection can only use one charset at a time?

ANSWER TO QUESTION #4

You may have to shift character sets with the DB Session. Here are the settings that can be changed at the session level:

Please set these carefully before reading from and writing to the database. It would also be wise to store the character set name and collation in the same table you will be accessing.

Mariadb – Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and (utf8_turkish_ci,IMPLICIT) for operation ‘=’ and why

It would be useful if you can provide your tables definition. However, your error suggests that your are trying to check equality between two columns having different collation.

One solution is to change collation of one of the columns to match the other collation. Check here - Column Level.

Other solution is to use COLLATE clause (I do not know if this works in mariadb, but it should be there).

Personally, I prefer to avoid COLLATE and harmonize string columns as collation mismatch tend to force lots of COLLATEs (maintenance problems).