Mysql – Different characters, same ASCII code

encodingMySQLvarchar

I have this query that throws two results:

SELECT id FROM table1 WHERE id like 'nm041033%'

nm0410331
nm0410331

And this slightly different query that throws only one result:

SELECT id FROM table1 WHERE id='nm0410331'

nm0410331

I tried to check the ASCII of the last character and got the same:

SELECT id,ascii(substr(id,9,1)) FROM table1 WHERE id like 'nm041033%'

nm0410331 49
nm0410331 49

I guess it is a rare encoding problem. How can I solve it?

PS: The field id is a primary key. The charset is latin1_general_ci, and the values were inserted using PHP utf8_decode().

UPDATE: I changed the charset to ascii_general_ci, and now this query gives me zero results:

SELECT id FROM table1 WHERE id='nm0410331'

However, those two ids are not the same yet. If I use SELECT DISTINCT or GROUP BY I get two rows.

PS: The last character isn't the number you can type with the keyboard.

Best Answer

Thanks to the insight of Akina, who suggested to use HEX() to check the field, I found an extra '0A' byte at the end of one of the values.

After removing the primary key constraing (to avoid the temporary duplicate id), I used: UPDATE table1 SET id = TRIM(TRAILING UNHEX('0A') FROM id); And was able to solve it.

PS: For future googlers, using SELECT id FROM table1 WHERE id like 'nm0410331%' could make me note my silly problem too...

QUESTION #1

Why are there different levels of MySQL collation/charsets?

ANSWER TO QUESTION #1

There are two good reasons for different character sets and collations

Reason #1 : Disk Space

When you run this query

SELECT
    maxlen,
    GROUP_CONCAT(CHARACTER_SET_NAME) CharSets,
    COUNT(1) CharSetCount
FROM information_schema.character_sets
GROUP BY maxlen\G

You get this:

mysql> SELECT
    ->     maxlen,
    ->     GROUP_CONCAT(CHARACTER_SET_NAME) CharSets,
    ->     COUNT(1) CharSetCount
    -> FROM information_schema.character_sets
    -> GROUP BY maxlen\G
*************************** 1. row ***************************
      maxlen: 1
    CharSets: cp1257,cp850,binary,koi8r,latin2,ascii,tis620,koi8u,greek,armscii8,keybcs2,macroman,latin7,cp1251,cp1256,dec8,hp8,geostd8,latin1,swe7,hebrew,cp1250,latin5,cp866,macce,cp852
CharSetCount: 26
*************************** 2. row ***************************
      maxlen: 2
    CharSets: big5,cp932,sjis,gbk,ucs2,euckr,gb2312
CharSetCount: 7
*************************** 3. row ***************************
      maxlen: 3
    CharSets: eucjpms,ujis,utf8
CharSetCount: 3
*************************** 4. row ***************************
      maxlen: 4
    CharSets: utf16,utf32,utf8mb4
CharSetCount: 3
4 rows in set (0.00 sec)

mysql>

Some character sets have a Maximum Length of 1 byte to represent a character. Other need more. Give this information, you may want to refrain from using the eucjpms, ujis, utf8, utf16, utf32, utf8mb4 character sets so that VARCHAR and TEXT data takes less space on disk.

Reason #2 : Internationalization

Characters Sets Each Come With One or More Collations to cover a variety of Languages

When you run this query

SELECT
    A.CHARACTER_SET_NAME,
    GROUP_CONCAT(COLLATION_NAME) Collations,
    COUNT(1) CollationCount
FROM
    information_schema.character_sets A
    INNER JOIN information_schema.collations B
    USING (CHARACTER_SET_NAME)
GROUP BY A.CHARACTER_SET_NAME\G

You will see that some Characters Sets have with multiple collations for Different Parts of Europe. Chinese, Japanese, Greek, and parts of Asia Minor and Scandinavia are also available.

QUESTION #2

Should you always ensure your PHP connection matches the charset of the database you're working on?

ANSWER TO QUESTION #2

SCENARIO

You are driving at 3:00 AM. You are the only driver on the road. You come to an intersection. You have the red light.

Question : Do you stop or go through the red light?

Answer : Depends on the neighborhood

Safe neighborhood ?
- Some abide by the law, stop at the red, and wait for green.
- Some chance it and go through
Bad neighborhood or new to the area ?
- Some abide by the law, stop at the red, and wait for green AT THE RISK OF A CARJACKING
- Some chance it and go through to AVOID OR REDUCE RISK OF A CARJACKING
- Assume the worst and find another route

How does this apply?

You should err on the side of caution. You should always check the charset beforehand because you do not know the neighborhood (client program, internet browser) the PHP connection will be entering and if there is a risk of a carjacking (putting invalid data into the database, requesting too much data for retrieval).

QUESTION #3

If you can have different tables that use different character sets do you just use SET NAMES or mysql(i)_set_charset to switch?

ANSWER TO QUESTION #3

By all means

QUESTION #4

If you have a table that has multiple charsets how do you manage that since the connection can only use one charset at a time?

ANSWER TO QUESTION #4

You may have to shift character sets with the DB Session. Here are the settings that can be changed at the session level:

Please set these carefully before reading from and writing to the database. It would also be wise to store the character set name and collation in the same table you will be accessing.