SQL Server Replace Function – Removing Ones When Removing char(185)

collationsql serversql-server-2008-r2temporary-tablesunicode

Ok as the title suggests this is occuring.

When using the REPLACE function on char(185) for data in a temporary table, ones ("1") are being removed as well.

SQL SERVER 2008R2
UNICODE Data

DB collation – collate SQL_Latin1_General_CP1_CS_AS
Tempdb has default

Any other details I could provide which might be helpful, let me know.

Best Answer

CHAR(185), being VARCHAR data and using an 8-bit code page, can be a different character depending on the code page being used (which is determined by the collation of the current database when using the CHAR() function). In many code pages, including the common Windows-1252 (Latin1_General), the character is a superscript 1 (i.e. "¹" ). This character, depending on the collation / code page being used, might have a "best fit" mapping to a regular "1".

Also, when using a Windows collation, case-sensitivity can affect the outcome of matches between subscript, superscript, and regular versions of the same characters (seriously! see example below). This also covers NVARCHAR / Unicode data (including UTF-8 in VARCHAR starting in SQL Server 2019).

When needing to match on specific characters you need to use a binary collation (i.e. one ending in _BIN2 unless on SQL Server 2005 or older, in which case only the _BIN collations are available).

SELECT CHAR(185)
WHERE  CHAR(185) = '1' COLLATE SQL_Latin1_General_CP1_CI_AS
-- no rows (SQL Server collation, code page 1252, case insensitive)



SELECT CHAR(185)
WHERE  CHAR(185) = '1' COLLATE SQL_Latin1_General_CP437_CI_AS
-- 1 row (SQL Server collation, code page 437, case insensitive)

SELECT CHAR(185)
WHERE  CHAR(185) = '1' COLLATE SQL_Latin1_General_CP437_CS_AS
-- 1 row (SQL Server collation, code page 437, case sensitive)



SELECT CHAR(185)
WHERE  CHAR(185) = '1' COLLATE Latin1_General_100_CI_AS
-- 1 row (Windows collation, code page 1252, case insensitive)

SELECT CHAR(185)
WHERE  CHAR(185) = '1' COLLATE Latin1_General_100_CS_AS
-- no rows (Windows collation, code page 1252, case sensitive)



SELECT CHAR(185)
WHERE  CHAR(185) = '1' COLLATE Latin1_General_100_BIN2;
-- no rows (Windows collation, code page 1252, _BIN2 collation; always works!)

As noted above, Unicode data (regardless of type of collation) works the same as VARCHAR data used with a Windows collation (because both scenarios use the same Unicode rules):

SELECT NCHAR(185)
WHERE  NCHAR(185) = N'1' COLLATE Latin1_General_100_CI_AS
-- 1 row (Windows collation, Unicode data, case insensitive)

SELECT NCHAR(185)
WHERE  NCHAR(185) = N'1' COLLATE Latin1_General_100_CS_AS
-- no rows (Windows collation, Unicode data, case sensitive)

Related Solutions

SQL Server to MySQL Migration – Remove UCS-2 Surrogate Pairs

You need to take the data from UTF-8 and convert it into UCS-2LE using something like iconv. For example, using the character in your example:

echo "010000: dcb3" | xxd -r -s -0x10000 | iconv -f "UTF-8" -t "UCS-2LE" | xxd
0000000: 3307

Now I'm not sure what character UTF-8 \xdcb3 is, but apparently it's correct translation to UCS-2LE is \U0733. If you have \0xDCB3 in the SQL Server it means it was not translated into UCS-2LE before import. You should not have surrogates in the NVARCHAR fields, UCS-2 is "surrogate agnostic". See UCS-2 vs. UTF-16 (not quite Kramer vs. Kramer).

I'm not an expert in the MySQL tool set so I can't say what step is missing that was supposed to do the iconv.

Update

to locate the records with surrogates you must turn to the binary representation, since any character function will threat the surrogates as 'special':. Luckily the string manipulation functions work on binary too with the expected semantics. Eg. CHARINDEX:

insert into test(a) values  (N'a');
insert into test(a) values  (NCHAR(0xdc83));
insert into test(a) values  (N'b');
go

select * from test where charindex(0x83dc, cast(a as varbinary(8000))) > 0;

Sql-server – Function is counting -1 char ( wrong )

The problem with your code is probably that the logic looks for the @SplitOn character to know where the end of the SUBSTRING should be. The fix would be to simply add the @SplitOn character to the end of @List at the start of the function.

HOWEVER, I think you are going about the operation in the wrong way. First, splitting using a WHILE loop is horribly inefficient, especially since you have to recalculate @List in each iteration of the loop.

You would be far better off using an efficient string splitter and then doing a COUNT(*) combined with a WHERE val = '5'.

For example:

SELECT COUNT(*)
FROM   SQL#.String_Split4k(N'5,5,5,5,55,5,5,5,5', N',', 1) split
WHERE  split.SplitVal = N'5';
-- 8

Please note that the example uses a SQLCLR function from the SQL# library (which I am the author of, but String_Split4k is in the Free version) to do the string splitting.

Best Answer

Related Solutions

SQL Server to MySQL Migration – Remove UCS-2 Surrogate Pairs

Sql-server – Function is counting -1 char ( wrong )

Related Question