Sql-server – N’Șc’ considered duplicate key of N’C’ using Latin1_General_CI_AS collation

collationsql serversql server 2014t-sqlunique-constraint

I have a table with a unique key that includes an NVARCHAR(50) column (correct or not, but is there). So, when trying to insert Șc or C (doesn't matter the order of the insert) it breaks on the 2nd insert due to collation issues. Here is the error:

(1 row(s) affected) Msg 2601, Level 14, State 1, Line 16
Cannot insert duplicate key row in object 'dbo.testT' with unique index 'IX_TestT'. The duplicate key value is (C).

Select returns:

Database default collation is Latin1_General_CI_AS. Spent some time looking on how to solve it, without changing too much the already existing structure, but can't find a way to getting working. Tried different collations and combinations, everything fails. Read (here and here) about character expansions and so on, still stuck. Here is a sample code that I'm using to replicate the issue, feel free to modify and recommend anything that could help solve this.

CREATE TABLE testT (
    [Default_Collation]     [NVARCHAR] (50) COLLATE DATABASE_DEFAULT,
    [Latin1_General_CI_AS]  [NVARCHAR] (50) COLLATE Latin1_General_CI_AS,
    [Latin1_General_CI_AI]  [NVARCHAR] (50) COLLATE Latin1_General_CI_AI,
    [SQL_Collation]         [NVARCHAR] (50) COLLATE SQL_Latin1_General_CP1_CI_AS);
CREATE UNIQUE CLUSTERED INDEX [IX_TestT] ON [dbo].[testT] ([Default_Collation])
ON [PRIMARY]
GO

INSERT INTO testT
SELECT  N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc'   --COLLATE Latin1_General_CI_AS

INSERT INTO testT
SELECT  N'C'    --COLLATE Latin1_General_CI_AS 
        ,N'C'   --COLLATE Latin1_General_CI_AS
        ,N'C'   --COLLATE Latin1_General_CI_AS
        ,N'C'   --COLLATE SQL_Latin1_General_CP1_CI_AS

SELECT * FROM testT;

DROP TABLE testT;

Best Answer

The problem is that the old SQL Server Collations (i.e. those with names starting with SQL_) and the first two versions of the Windows Collations (the 80 series which came with SQL Server 2000 and do not have a version number in the name, and the 90 series which came with SQL Server 2005) are missing the sort weights for a great many characters. This was mostly corrected starting with the 100 series Collations that came with SQL Server 2008.

As you can see in the below examples, the Ș character matches an empty string when using non-binary, version 80 or 90 Collations (and SQL Server Collations) since they both have the same sort weight: 0. Nothing. Nada. This means that when comparing N'Șc' with N'C' (using pre-series 100 Collations), you are really comparing N'c' with N'C' (test #1):

SELECT 1 WHERE N'Șc' = N'C' COLLATE Latin1_General_CS_AS;
-- no result (due to "c" and "C" being different case)

SELECT 2 WHERE N'Ș' = N'' COLLATE SQL_Latin1_General_CP1_CI_AS;
SELECT 3 WHERE N'Ș' = N'' COLLATE Latin1_General_CI_AS;

SELECT 4 WHERE N'Ș' = N'' COLLATE Latin1_General_BIN2;
-- no result (due to "Ș" still being a code point and empty string has no code points)

SELECT 5 WHERE N'Ș' = N'' COLLATE Latin1_General_100_CI_AS;
-- no result (due to "Ș" finally having a sort weight in 100 series Collations)

SELECT 6 WHERE N'Ș' = N'' COLLATE Chinese_PRC_CI_AI;
SELECT 7 WHERE N'Ș' = N'' COLLATE Chinese_PRC_90_CI_AI;

SELECT 8 WHERE N'Ș' = N'' COLLATE Indic_General_90_CI_AI;
SELECT 9 WHERE N'Ș' = N'' COLLATE Indic_General_100_CI_AI;
-- no result (due to "Ș" finally having a sort weight in 100 series Collations)

So, unfortunately you will need to drop the PK, alter the column to have a 100-level Collation (e.g. Latin1_General_100_CI_AS_SC), and then recreate the PK. Please note that the difference in that suggested Collation from the current Collation is both the 100 and the _SC at the end, which allows it to properly handle supplementary characters.

This does not mean that you need to:

change the Collations of other tables (unless they have the same setup of NVARCHAR in the PK)
change the Database's default Collation. The main issue with not changing the DB's Collation is that there will be a behavior difference between doing table.column = N'Ș' and @variable = N'Ș' since variables and string literals use the Database's default Collation.

For more examples of this behavior, please see the "Supplementary Characters" section of the following blog post of mine:

The Uni-Code: The Search for the True List of Valid Characters for T-SQL Identifiers, Part 3 of 2 (Delimited Identifiers)

Related Solutions

Sql-server – Unique index corrupted SQL. Select query returns single row but create unique index fails

if you consider error, Msg 1505, Level 16, State 1, Line 2 The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.MSmerge_contents' and the index name 'uc1SycContents'. The duplicate key value is (7696031, 08703987-557d-e111-9888-e61f13c44f03)...... I am running below Query
select * from msmerge_contents where rowguid='08703987-557d-e111-9888-e61f13c44f03'
and it is returning only 1 row

When you have index corruption problems (ie. keys present in NC index but not in base table or vice-versa) you must be very careful about the SQL you use to validate data. At this moment your data is inconsistent but the query optimizer does not know that and completely trusts your schema, including these incorrect indexes. As such it may optimize your query to use one of the NC indexes that is missing a key and the result will also miss a a key falsely returning no duplicates. To solve this catch-22 situation you need to force the optimizer hand by explicitly requesting an index or another and make sure the projected list of columns can be satisfied by the index you enforced (ie. no *). Assuming uc1SycContents is not the clustered index, try out the following:

select rowguid
from msmerge_contents with INDEX (1)
where rowguid='08703987-557d-e111-9888-e61f13c44f03';

select rowguid
from msmerge_contents with INDEX ([uc1SycContents])
where rowguid='08703987-557d-e111-9888-e61f13c44f03';

This will forcefully check if the rowguid has a duplicate for that guid in the base table clustered index (index id 1) vs. the index uc1SycContents. I expect that the first query returns 2 (or more) rows while the second returns 1.

Sql-server – Using collate: When to use and when not to use

If you are going to use custom collations for specific databases then yes, you'll need to make the collations match whenever you are joining or unioning data from the two databases.

In fact you will need to do this with many metadata queries anyway. Just look at catalog views like sys.tables:

SELECT c.name, c.collation_name
FROM sys.all_columns AS c
INNER JOIN sys.all_views AS v
ON c.[object_id] = v.[object_id]
INNER JOIN sys.schemas AS s
ON v.[schema_id] = s.[schema_id]
WHERE s.name = N'sys' AND v.name = N'tables'
AND c.collation_name IS NOT NULL;

Results:

name                    SQL_Latin1_General_CP1_CI_AS
type                    Latin1_General_CI_AS_KS_WS
type_desc               Latin1_General_CI_AS_KS_WS
lock_escalation_desc    Latin1_General_CI_AS_KS_WS
durability_desc         Latin1_General_CI_AS_KS_WS

So here we have columns with two different collations inside a single system object. No database can have a DATABASE_DEFAULT that matches both...

You don't have any control over the collation of your customer's columns or databases, or the server collation. So really the only ways to resolve the conflict are to:

use a method that doesn't hard-code a specific collation (like DATABASE_DEFAULT)
hard-code some specific, compatible collation on both sides

Since the latter is more work, makes for more complex queries, and introduces more opportunities to turn seeks into scans, I think the former is really your best option.

Best Answer

Related Solutions

Sql-server – Unique index corrupted SQL. Select query returns single row but create unique index fails

Sql-server – Using collate: When to use and when not to use

Related Question