MySQL Case Sensitive Join on Varchar Columns

case sensitivecollationMySQLmysql-5.7

I have two tables, both with identical data types, character sets, and collation explicitly specified.

CLERK CHAR(3) CHARACTER SET latin1 COLLATE latin1_bin NULL

For a case-sensitive join on the CLERK field, do I need to specify the collation in the join clause as well, or does the fact that it is specified at the DDL level mean that it isn't needed in the join?

FROM
    CUSTOMER S JOIN 
    CLERK C ON S.CLERK = C.CLERK COLLATE latin1_bin

Best Answer

For a case-sensitive join on the CLERK field, do I need to specify the collation in the join clause as well, or does the fact that it is specified at the DDL level mean that it isn't needed in the join?

Simple answer: No need to specify the collation in the query. DDL level collation (especially when it's the same on both sides and both sides are columns) will be used (which is why we specify it at the DDL level in the first place).

Detailed answer: there is a hierarchy of precedence for which collation to use in a given operation (concatenation, predicate, etc). The collation used not only depends on whether it's a column vs a literal, etc, but also whether it's Unicode vs non-Unicode, and even binary vs non-binary. The full description can be found here, Collation Coercibility in Expressions, which is the MySQL 5.7 documentation since you are using that version. One rather interesting rule is:

For an operation with operands from the same character set but that mix a _bin collation and a _ci or _cs collation, the _bin collation is used. This is similar to how operations that mix nonbinary and binary strings evaluate the operands as binary strings, except that it is for collations rather than data types.

All of that being said, I need to point out that the statement "binary collations are case-sensitive", while extremely common, is definitely incorrect.

On a very basic level, sorting is different. Case-sensitive collations will sort "a" with "A" (though which one comes first can depend on the culture), "b" with "B", and so on. Binary collations will sort according to the underlying value / Code Point of each character, which becomes very apparent when uppercase and lowercase versions of a character separated by others according to their value.
Being case "sensitive" means that you can also be "INsensitive" to other properties of characters, the main one being accents. That is also the only other property when it comes to non-Unicode character sets, but Unicode allows for Kana type sensitivity, width sensitivity, and SQL Server (as of version 2017) even allows for variation selector sensitivity. Binary collations do not allow for a character to equal anything other than itself, even if other forms of it exist. Again, this doesn't really happen so much in non-Unicode character sets, but in Unicode there can be several versions of a character, including wide, superscript, subscript, italics, upside down (referred to as "turned"), etc. MySQL, starting in version 8.0 (as far as I can tell), is adding more variations to the sensitivity options (at least for the ut8mb4 character set and collations):
- New collations in MySQL 8.0.0 (accent and case INsensitive)
- MySQL 8.0.1: Accent and case sensitive collations for utf8mb4 (accent and case sensitive)
- The MySQL 8.0.2 Milestone Release is available (accent sensitive, case INsensitive)

To illustrate both of these points, I have set up a demo on the awesome db<>fiddle. I had to use MySQL 8.0 not only to get the _ai_ci collation, but also because the COLLATE latin1_general_ci clause (2nd to last query, #5) had no affect (for some odd reason; documentation states that when collation name has only _ci, then _ai is implied, yet for both MySQL 5.6 and 5.7 on db<>fiddle, the behavior is still _as_cs or _bin).

AND, there are even other ways in which binary collations are not "case-sensitive". They cannot account for:

combining diacritics
expansions
contractions

I did not list or provide examples for these earlier because they do not pertain to 8-bit encodings, and latin1 is an 8-bit encoding. These are features of Unicode, and so should apply to any Unicode collation (though I have not tested them on MySQL, but they are implemented correctly in SQL Server).

P.S. I have a detailed explanation of all of this (including the Unicode-specific behaviors noted directly above) in the following post: No, Binary Collations are not Case-Sensitive. That is currently framed only in the context of SQL Server, but I can work in the example that I came up with for this answer when I have time. What's important is that the concept is the same across RDBMSs.

For reference (just in case db<>fiddle is inaccessible), the queries are:

Query 1

CREATE TABLE CLERK (
  CLERK_ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  CLERK CHAR(3) CHARACTER SET latin1 COLLATE latin1_bin NULL
);

CREATE TABLE CUSTOMER (
  CUSTOMER_ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  NAME VARCHAR(10) NOT NULL,
  CLERK CHAR(3) CHARACTER SET latin1 COLLATE latin1_bin NULL,
  CLERK_CI_AI CHAR(3) CHARACTER SET latin1 COLLATE latin1_general_ci NULL
);

INSERT INTO CUSTOMER (NAME, CLERK, CLERK_CI_AI) VALUES('John Doe', 'Ü', 'Ü');

INSERT INTO CLERK (CLERK) VALUES('Ü');
INSERT INTO CLERK (CLERK) VALUES('ü');
INSERT INTO CLERK (CLERK) VALUES('U');
INSERT INTO CLERK (CLERK) VALUES('u');
INSERT INTO CLERK (CLERK) VALUES('Ù');
INSERT INTO CLERK (CLERK) VALUES('ù');
INSERT INTO CLERK (CLERK) VALUES('Û');
INSERT INTO CLERK (CLERK) VALUES('û');

INSERT INTO CLERK (CLERK) VALUES('Y');
INSERT INTO CLERK (CLERK) VALUES('y');
INSERT INTO CLERK (CLERK) VALUES('Ý');
INSERT INTO CLERK (CLERK) VALUES('ý');

Query 2

SELECT 0 AS "ver", C.*
FROM   CLERK C;

Query 3

SELECT "bin sort" AS "ver", C.*
FROM   CLERK C
ORDER BY C.CLERK;

Query 4

SELECT "cs sort" AS "ver", C.*
FROM   CLERK C
ORDER BY C.CLERK COLLATE latin1_general_cs;

Query 5

SELECT 1 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK = C.CLERK;

Query 6

SELECT 2 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK = C.CLERK COLLATE latin1_bin;

Query 7

SELECT 3 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK_CI_AI = C.CLERK;

Query 8

SELECT 4 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK_CI_AI = C.CLERK COLLATE latin1_bin;

Query 9

# is accent-sensitive even though documentation states it should be
# accent-INsensitive
# 
SELECT 5 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON S.CLERK_CI_AI = C.CLERK COLLATE latin1_general_ci;

Query 10

SELECT 6 AS "ver", S.*, C.*
FROM CUSTOMER S
JOIN CLERK C
  ON CONVERT(S.CLERK_CI_AI using utf8mb4)
       = CONVERT(C.CLERK USING utf8mb4) COLLATE utf8mb4_0900_ai_ci;

Related Solutions

Postgresql – Can PostgreSQL create a case sensitive database

I ended up writing some code that transforms the SQL generated by the application into PostgreSQL-compatible SQL. It's pretty straightforward:

Split the statement into sensible tokens, skipping single-quoted string literals
Double-quote anything that is not a keyword or number

I also took advantage of this layer to transform calls to isnull to coalesce. So far it's working pretty well.

Mysql – How to do a case sensitive search in WHERE clause

by default, MySQL does not consider the case of the strings

This is not quite true. Whenever you create database in MySQL, the database/schema has a character set and a collation. Each character set has a default collation; see here for more information.

The default collation for character set latin1, which is latin1_swedish_ci, happens to be case-insensitive.

You can choose a case-sensitive collation, for example latin1_general_cs (MySQL grammar):

CREATE SCHEMA IF NOT EXISTS `myschema` 
DEFAULT CHARACTER SET latin1 
COLLATE latin1_general_cs ;

This has an effect on things like grouping and equality. For example,

create table casetable (
  id int primary key, 
  thing varchar(50)
);

select * from casetable;
+----+-------+
| id | thing |
+----+-------+
|  3 | abc   |
|  4 | ABC   |
|  5 | aBc   |
|  6 | abC   |
+----+-------+

In a case-sensitive database, we get:

select thing, count(*) from casetable group by thing;
+-------+----------+
| thing | count(*) |
+-------+----------+
| ABC   |        1 |
| aBc   |        1 |
| abC   |        1 |
| abc   |        1 |
+-------+----------+

select * from casetable where thing = "abc";
+----+-------+
| id | thing |
+----+-------+
|  3 | abc   |
+----+-------+

While in a case-insensitive database, we get:

select thing, count(*) from casetable group by thing;
+-------+----------+
| thing | count(*) |
+-------+----------+
| abc   |        4 |
+-------+----------+

select * from casetable where thing = "abc";
+----+-------+
| id | thing |
+----+-------+
|  3 | abc   |
|  4 | ABC   |
|  5 | aBc   |
|  6 | abC   |
+----+-------+

Note that you can also change the collation from within a query. For example, in the case-sensitive database, I can do

select * from casetable where thing collate latin1_swedish_ci = "abc";
+----+-------+
| id | thing |
+----+-------+
|  3 | abc   |
|  4 | ABC   |
|  5 | aBc   |
|  6 | abC   |
+----+-------+

Best Answer

Related Solutions

Postgresql – Can PostgreSQL create a case sensitive database

Mysql – How to do a case sensitive search in WHERE clause

Related Question