Yes, you can set the default character set and collation on various levels.
First, in your question, on the database level:
CREATE DATABASE IF NOT EXISTS foo
DEFAULT CHARACTER SET = 'utf8' DEFAULT COLLATE 'utf8_general_ci'
or
ALTER DATABASE foo
DEFAULT CHARACTER SET = 'utf8' DEFAULT COLLATE 'utf8_general_ci'
All new tables in that database should be created with the new character set and database if it's not specified in the table creation command (if it is specified in table command, the specification will override the database default).
You can set the default at a server level (so that all tables are created with the default for all databases unless explicitly stated otherwise) by providing the --character-set-server
and --collation-server
options in your my.cnf and restarting the server.
And finally, verify the client connection settings. This doc explains how the client connection can fine-tune (and override) the default settings.
QUESTION #1
Why are there different levels of MySQL collation/charsets?
ANSWER TO QUESTION #1
There are two good reasons for different character sets and collations
Reason #1 : Disk Space
When you run this query
SELECT
maxlen,
GROUP_CONCAT(CHARACTER_SET_NAME) CharSets,
COUNT(1) CharSetCount
FROM information_schema.character_sets
GROUP BY maxlen\G
You get this:
mysql> SELECT
-> maxlen,
-> GROUP_CONCAT(CHARACTER_SET_NAME) CharSets,
-> COUNT(1) CharSetCount
-> FROM information_schema.character_sets
-> GROUP BY maxlen\G
*************************** 1. row ***************************
maxlen: 1
CharSets: cp1257,cp850,binary,koi8r,latin2,ascii,tis620,koi8u,greek,armscii8,keybcs2,macroman,latin7,cp1251,cp1256,dec8,hp8,geostd8,latin1,swe7,hebrew,cp1250,latin5,cp866,macce,cp852
CharSetCount: 26
*************************** 2. row ***************************
maxlen: 2
CharSets: big5,cp932,sjis,gbk,ucs2,euckr,gb2312
CharSetCount: 7
*************************** 3. row ***************************
maxlen: 3
CharSets: eucjpms,ujis,utf8
CharSetCount: 3
*************************** 4. row ***************************
maxlen: 4
CharSets: utf16,utf32,utf8mb4
CharSetCount: 3
4 rows in set (0.00 sec)
mysql>
Some character sets have a Maximum Length of 1 byte to represent a character. Other need more. Give this information, you may want to refrain from using the eucjpms, ujis, utf8, utf16, utf32, utf8mb4 character sets so that VARCHAR and TEXT data takes less space on disk.
Reason #2 : Internationalization
Characters Sets Each Come With One or More Collations to cover a variety of Languages
When you run this query
SELECT
A.CHARACTER_SET_NAME,
GROUP_CONCAT(COLLATION_NAME) Collations,
COUNT(1) CollationCount
FROM
information_schema.character_sets A
INNER JOIN information_schema.collations B
USING (CHARACTER_SET_NAME)
GROUP BY A.CHARACTER_SET_NAME\G
You will see that some Characters Sets have with multiple collations for Different Parts of Europe. Chinese, Japanese, Greek, and parts of Asia Minor and Scandinavia are also available.
QUESTION #2
Should you always ensure your PHP connection matches the charset of the database you're working on?
ANSWER TO QUESTION #2
SCENARIO
You are driving at 3:00 AM. You are the only driver on the road. You come to an intersection. You have the red light.
Question : Do you stop or go through the red light?
Answer : Depends on the neighborhood
- Safe neighborhood ?
- Some abide by the law, stop at the red, and wait for green.
- Some chance it and go through
- Bad neighborhood or new to the area ?
- Some abide by the law, stop at the red, and wait for green AT THE RISK OF A CARJACKING
- Some chance it and go through to AVOID OR REDUCE RISK OF A CARJACKING
- Assume the worst and find another route
How does this apply?
You should err on the side of caution. You should always check the charset beforehand because you do not know the neighborhood (client program, internet browser) the PHP connection will be entering and if there is a risk of a carjacking (putting invalid data into the database, requesting too much data for retrieval).
QUESTION #3
If you can have different tables that use different character sets do you just use SET NAMES or mysql(i)_set_charset to switch?
ANSWER TO QUESTION #3
By all means
QUESTION #4
If you have a table that has multiple charsets how do you manage that since the connection can only use one charset at a time?
ANSWER TO QUESTION #4
You may have to shift character sets with the DB Session. Here are the settings that can be changed at the session level:
Please set these carefully before reading from and writing to the database. It would also be wise to store the character set name and collation in the same table you will be accessing.
Best Answer
There are a few options for managing character set conversions. As you have found, there is the alter table option, and you can use the character set clause or the collate clause or both in your statements.
Other options include changing the character set and collation for the whole database (which I can see you don't want to do just yet).
Or there's a tool in the free and open source Percona Toolkit that is very popular with developers, pt-online-schema-change, that helps you manage such migrations for tables with a primary key.
David Berube, a consultant independent from Percona, wrote a detailed blog post on our community blog about some of the problems that you can encounter when changing character sets and collations. You can read that at https://www.percona.com/community-blog/2018/06/12/character-sets-migrating-utf8mb4-pt_online_schema_change/
The potential pitfalls included database version compatibility, application "expectations", key length management (the length could change), and the fact that you could see false positives i.e. a column that says it's latin1 may not be!
This time on the Percona website, there are a number of blog posts and also a free webinar Troubleshooting Issues with MySQL Character Sets https://www.percona.com/resources/webinars/troubleshooting-issues-mysql-character-sets
Hope some of these will help you work out the best way to approach your scenarios.
-- Disclosure: I work for Percona.