Sql-server – N’Șc’ considered duplicate key of N’C’ using Latin1_General_CI_AS collation

collationsql serversql server 2014t-sqlunique-constraint

I have a table with a unique key that includes an NVARCHAR(50) column (correct or not, but is there). So, when trying to insert Șc or C (doesn't matter the order of the insert) it breaks on the 2nd insert due to collation issues. Here is the error:

(1 row(s) affected) Msg 2601, Level 14, State 1, Line 16
Cannot insert duplicate key row in object 'dbo.testT' with unique index 'IX_TestT'. The duplicate key value is (C).

Select returns:

enter image description here

Database default collation is Latin1_General_CI_AS. Spent some time looking on how to solve it, without changing too much the already existing structure, but can't find a way to getting working. Tried different collations and combinations, everything fails. Read (here and here) about character expansions and so on, still stuck. Here is a sample code that I'm using to replicate the issue, feel free to modify and recommend anything that could help solve this.

CREATE TABLE testT (
    [Default_Collation]     [NVARCHAR] (50) COLLATE DATABASE_DEFAULT,
    [Latin1_General_CI_AS]  [NVARCHAR] (50) COLLATE Latin1_General_CI_AS,
    [Latin1_General_CI_AI]  [NVARCHAR] (50) COLLATE Latin1_General_CI_AI,
    [SQL_Collation]         [NVARCHAR] (50) COLLATE SQL_Latin1_General_CP1_CI_AS);
CREATE UNIQUE CLUSTERED INDEX [IX_TestT] ON [dbo].[testT] ([Default_Collation])
ON [PRIMARY]
GO

INSERT INTO testT
SELECT  N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc'   --COLLATE Latin1_General_CI_AS

INSERT INTO testT
SELECT  N'C'    --COLLATE Latin1_General_CI_AS 
        ,N'C'   --COLLATE Latin1_General_CI_AS
        ,N'C'   --COLLATE Latin1_General_CI_AS
        ,N'C'   --COLLATE SQL_Latin1_General_CP1_CI_AS

SELECT * FROM testT;

DROP TABLE testT;

Best Answer

The problem is that the old SQL Server Collations (i.e. those with names starting with SQL_) and the first two versions of the Windows Collations (the 80 series which came with SQL Server 2000 and do not have a version number in the name, and the 90 series which came with SQL Server 2005) are missing the sort weights for a great many characters. This was mostly corrected starting with the 100 series Collations that came with SQL Server 2008.

As you can see in the below examples, the Ș character matches an empty string when using non-binary, version 80 or 90 Collations (and SQL Server Collations) since they both have the same sort weight: 0. Nothing. Nada. This means that when comparing N'Șc' with N'C' (using pre-series 100 Collations), you are really comparing N'c' with N'C' (test #1):

SELECT 1 WHERE N'Șc' = N'C' COLLATE Latin1_General_CS_AS;
-- no result (due to "c" and "C" being different case)

SELECT 2 WHERE N'Ș' = N'' COLLATE SQL_Latin1_General_CP1_CI_AS;
SELECT 3 WHERE N'Ș' = N'' COLLATE Latin1_General_CI_AS;

SELECT 4 WHERE N'Ș' = N'' COLLATE Latin1_General_BIN2;
-- no result (due to "Ș" still being a code point and empty string has no code points)

SELECT 5 WHERE N'Ș' = N'' COLLATE Latin1_General_100_CI_AS;
-- no result (due to "Ș" finally having a sort weight in 100 series Collations)

SELECT 6 WHERE N'Ș' = N'' COLLATE Chinese_PRC_CI_AI;
SELECT 7 WHERE N'Ș' = N'' COLLATE Chinese_PRC_90_CI_AI;

SELECT 8 WHERE N'Ș' = N'' COLLATE Indic_General_90_CI_AI;
SELECT 9 WHERE N'Ș' = N'' COLLATE Indic_General_100_CI_AI;
-- no result (due to "Ș" finally having a sort weight in 100 series Collations)

So, unfortunately you will need to drop the PK, alter the column to have a 100-level Collation (e.g. Latin1_General_100_CI_AS_SC), and then recreate the PK. Please note that the difference in that suggested Collation from the current Collation is both the 100 and the _SC at the end, which allows it to properly handle supplementary characters.

This does not mean that you need to:

  1. change the Collations of other tables (unless they have the same setup of NVARCHAR in the PK)
  2. change the Database's default Collation. The main issue with not changing the DB's Collation is that there will be a behavior difference between doing table.column = N'Ș' and @variable = N'Ș' since variables and string literals use the Database's default Collation.

For more examples of this behavior, please see the "Supplementary Characters" section of the following blog post of mine:

The Uni-Code: The Search for the True List of Valid Characters for T-SQL Identifiers, Part 3 of 2 (Delimited Identifiers)