SQL Server – Handling Problematic Slashes and Question Marks in Unique Index Values

sql serversql-server-2016unicode

What is it about the Phone values inserted below that SQL Server is treating them as identical in a unique index?

CREATE TABLE Phone
(
  Id int identity(1, 1) primary key,
  Phone nvarchar(448) not null
)
go

create unique index IX_Phone on Phone(Phone)
with (data_compression = page);
go

insert into Phone Values ('?281/?263-?8400');
insert into Phone Values ('‎281/‎263-‎8400');

select * from Phone;

drop table Phone;

I receive an error message:

Msg 2601, Level 14, State 1, Line 13
Cannot insert duplicate key row in object 'dbo.Phone' with unique index 'IX_Phone'. The duplicate key value is (?281/?263-?8400).

Best Answer

Your problem is that you passes unicode strings as non-unicode.

Your '‎281/‎263-‎8400' is a string of 15 characters, not 12, there are 3 non printable 8206 symbols, Left-to-right mark

select len('‎281/‎263-‎8400'); -- 15 !!!

Try this code where I pass the unicode values as unicode and see that there is no problem at all:

CREATE TABLE dbo.Phone
(
  Id int identity(1, 1) primary key,
  Phone nvarchar(448) not null
)
go

create unique index IX_Phone on Phone(Phone)
with (data_compression = page);
go

insert into Phone(phone) Values (N'?281/?263-?8400');
insert into Phone(phone) Values (N'‎281/‎263-‎8400');

select * from Phone;

And this is what your string '‎281/‎263-‎8400' really contains (dbo.nums is my table that contains natural numbers):

declare @t table(col1 nvarchar(100), col2 nvarchar(100));
insert into @t values (N'?281/?263-?8400',  N'‎281/‎263-‎8400'); 

select n, unicode(substring(col1, n, 1)), unicode(substring(col2, n, 1))
from @t cross join dbo.nums
where n <= 15;

enter image description here

Now what happens when you pass your unicode string as non unicode.

Your non-printable symbol 8206 is transformed into ?, that is how non-unicode strings works: every character that cannot be find in the corresponding codepage and represented as ascii code is substituted with question mark.

So for example if you use Latin Collation and try to compare Hebrew and Cyrillic characters (the same number of characters) as varchar they always be equal just because they are transformed to question marks while comparing as nvarchar they are different:

enter image description here

So if now you'll try to insert these 2 values (Hebrew + Cyrillic) into your Phone table passing them as non-unicode (without using N), only one of them will be inserted, and the other will be rejected by unique constraint. And if you'll try to select from Phone, '?????' will be returned