MySQL Case Sensitive Join on Varchar Columns

case sensitivecollationMySQLmysql-5.7

I have two tables, both with identical data types, character sets, and collation explicitly specified.

CLERK CHAR(3) CHARACTER SET latin1 COLLATE latin1_bin NULL

For a case-sensitive join on the CLERK field, do I need to specify the collation in the join clause as well, or does the fact that it is specified at the DDL level mean that it isn't needed in the join?

FROM
    CUSTOMER S JOIN 
    CLERK C ON S.CLERK = C.CLERK COLLATE latin1_bin

Best Answer

For a case-sensitive join on the CLERK field, do I need to specify the collation in the join clause as well, or does the fact that it is specified at the DDL level mean that it isn't needed in the join?

Simple answer: No need to specify the collation in the query. DDL level collation (especially when it's the same on both sides and both sides are columns) will be used (which is why we specify it at the DDL level in the first place).


Detailed answer: there is a hierarchy of precedence for which collation to use in a given operation (concatenation, predicate, etc). The collation used not only depends on whether it's a column vs a literal, etc, but also whether it's Unicode vs non-Unicode, and even binary vs non-binary. The full description can be found here, Collation Coercibility in Expressions, which is the MySQL 5.7 documentation since you are using that version. One rather interesting rule is:

For an operation with operands from the same character set but that mix a _bin collation and a _ci or _cs collation, the _bin collation is used. This is similar to how operations that mix nonbinary and binary strings evaluate the operands as binary strings, except that it is for collations rather than data types.

All of that being said, I need to point out that the statement "binary collations are case-sensitive", while extremely common, is definitely incorrect.

  1. On a very basic level, sorting is different. Case-sensitive collations will sort "a" with "A" (though which one comes first can depend on the culture), "b" with "B", and so on. Binary collations will sort according to the underlying value / Code Point of each character, which becomes very apparent when uppercase and lowercase versions of a character separated by others according to their value.

  2. Being case "sensitive" means that you can also be "INsensitive" to other properties of characters, the main one being accents. That is also the only other property when it comes to non-Unicode character sets, but Unicode allows for Kana type sensitivity, width sensitivity, and SQL Server (as of version 2017) even allows for variation selector sensitivity. Binary collations do not allow for a character to equal anything other than itself, even if other forms of it exist. Again, this doesn't really happen so much in non-Unicode character sets, but in Unicode there can be several versions of a character, including wide, superscript, subscript, italics, upside down (referred to as "turned"), etc. MySQL, starting in version 8.0 (as far as I can tell), is adding more variations to the sensitivity options (at least for the ut8mb4 character set and collations):

To illustrate both of these points, I have set up a demo on the awesome db<>fiddle. I had to use MySQL 8.0 not only to get the _ai_ci collation, but also because the COLLATE latin1_general_ci clause (2nd to last query, #5) had no affect (for some odd reason; documentation states that when collation name has only _ci, then _ai is implied, yet for both MySQL 5.6 and 5.7 on db<>fiddle, the behavior is still _as_cs or _bin).

AND, there are even other ways in which binary collations are not "case-sensitive". They cannot account for:

  • combining diacritics
  • expansions
  • contractions

I did not list or provide examples for these earlier because they do not pertain to 8-bit encodings, and latin1 is an 8-bit encoding. These are features of Unicode, and so should apply to any Unicode collation (though I have not tested them on MySQL, but they are implemented correctly in SQL Server).

P.S. I have a detailed explanation of all of this (including the Unicode-specific behaviors noted directly above) in the following post: No, Binary Collations are not Case-Sensitive. That is currently framed only in the context of SQL Server, but I can work in the example that I came up with for this answer when I have time. What's important is that the concept is the same across RDBMSs.


For reference (just in case db<>fiddle is inaccessible), the queries are:

Query 1

CREATE TABLE CLERK (
  CLERK_ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  CLERK CHAR(3) CHARACTER SET latin1 COLLATE latin1_bin NULL
);

CREATE TABLE CUSTOMER (
  CUSTOMER_ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  NAME VARCHAR(10) NOT NULL,
  CLERK CHAR(3) CHARACTER SET latin1 COLLATE latin1_bin NULL,
  CLERK_CI_AI CHAR(3) CHARACTER SET latin1 COLLATE latin1_general_ci NULL
);

INSERT INTO CUSTOMER (NAME, CLERK, CLERK_CI_AI) VALUES('John Doe', 'Ü', 'Ü');

INSERT INTO CLERK (CLERK) VALUES('Ü');
INSERT INTO CLERK (CLERK) VALUES('ü');
INSERT INTO CLERK (CLERK) VALUES('U');
INSERT INTO CLERK (CLERK) VALUES('u');
INSERT INTO CLERK (CLERK) VALUES('Ù');
INSERT INTO CLERK (CLERK) VALUES('ù');
INSERT INTO CLERK (CLERK) VALUES('Û');
INSERT INTO CLERK (CLERK) VALUES('û');

INSERT INTO CLERK (CLERK) VALUES('Y');
INSERT INTO CLERK (CLERK) VALUES('y');
INSERT INTO CLERK (CLERK) VALUES('Ý');
INSERT INTO CLERK (CLERK) VALUES('ý');

Query 2

SELECT 0 AS "ver", C.*
FROM   CLERK C;

Query 3

SELECT "bin sort" AS "ver", C.*
FROM   CLERK C
ORDER BY C.CLERK;

Query 4

SELECT "cs sort" AS "ver", C.*
FROM   CLERK C
ORDER BY C.CLERK COLLATE latin1_general_cs;

Query 5

SELECT 1 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK = C.CLERK;

Query 6

SELECT 2 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK = C.CLERK COLLATE latin1_bin;

Query 7

SELECT 3 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK_CI_AI = C.CLERK;

Query 8

SELECT 4 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK_CI_AI = C.CLERK COLLATE latin1_bin;

Query 9

# is accent-sensitive even though documentation states it should be
# accent-INsensitive
# 
SELECT 5 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK_CI_AI = C.CLERK COLLATE latin1_general_ci;

Query 10

SELECT 6 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON CONVERT(S.CLERK_CI_AI using utf8mb4)
       = CONVERT(C.CLERK USING utf8mb4) COLLATE utf8mb4_0900_ai_ci;