Sql-server – How to replace multiple unicode characters

sql serversql-server-2012t-sqlunicode

I have identified a couple of unicode characters in my database by means of the following query:

SELECT name FROM table WHERE name NOT LIKE '%'+CAST(name AS VARCHAR(1000))+'%'

…with which I get stuff like this:

MORRISON CPA LIMITED  E¦¬Ç¦®EŒÆŒƑȨˆÅ¸«Äº‹Å‹™Æ‰€ÆŒ‰E™Å…¬Å¸

Is it possible to get the collation when using a REPLACE function? Something like:

UPDATE table SET name = REPLACE(name,NOT LIKE '%'+CAST(name AS VARCHAR(1000))+'%','')

I know I can individually replace them REPLACE(name,'Æ','') but is there a way to replace everything that is unicode?

Edit: I don't know why but I get less rows if I use

SELECT name FROM table WHERE name != CAST(name AS VARCHAR(1000))

Best Answer

NVARCHAR - VARCHAR conversions

This topic has been discussed heavily in a previous question on DBA.SE

Some parts of the answers on that question:

which characters can be stored in an 8-bit / non-Unicode encoding depends on the code page, which is determined by the Collation.

&

VARCHAR can no longer be referred to as "non-Unicode". So, starting with the first public beta of SQL Server 2019 in September 2018, we should refer to VARCHAR as an "8-bit datatype", even when speaking in terms of versions prior to SQL Server 2019. This terminology holds true for all 4 types of encodings that can be used with VARCHAR:

Extended ASCII Double-Byte Character Sets (DBCS) EBCDIC UTF-8 (Unicode)

Solomon Rutzky

From the comments, I agree "Extended ASCII" is really bad term that actually means a code page that maps characters/code points in the 128-255 range, beyond the standard 0-127 code point range defined by ASCII.

Dan Guzman

Depending on your collation, some characters will be able to be stored in varchar / 8-bit datatype fields, some not.


My values do not transform to '?' when casting to varchar, your fiddle uses sql 2017 could that be the reason?

This can depend on collation, and should be due to converting of certain characters

I am using SQL Server 2017, collation Latin1_General_CI_AS

On my test, the Ƒ character was changed to ƒ instead of put as ? when converting nvarchar --> varchar.

Other than that, there are no different characters when comparing both, on my collation / version.

An example:

SELECT CAST([name] AS VARCHAR(1000)), [name] 
FROM dbo.Bla 
WHERE [name]
NOT LIKE '%'+CAST([name] AS VARCHAR(1000))+'%';

With the small difference on the F's.

(No column name)    name
ORRISON CPA LIMITED  E¦¬Ç¦®EŒÆŒƒÈ¨ˆÅ¸«Äº‹Å‹™Æ‰€ÆŒ‰E™Å…¬Å¸   ORRISON CPA LIMITED  E¦¬Ç¦®EŒÆŒƑȨˆÅ¸«Äº‹Å‹™Æ‰€ÆŒ‰E™Å…¬Å¸

You could compare the text here

Which makes the solution below not useful for you.


Some workarounds on values that fail to convert would be to

  1. Cast the values as varchar(1000), and then remove the ? after validating that no ?'s exist.
  2. Only update name values that do not have regular ?'s.
  3. Replacing the ? to a value that cannot exist in your data and then replacing it back.

On the premise that the unicode is changed to a ? when casting them to varchar

#1 Cast the values as varchar(1000), and then remove the ?.

SELECT [name] FROM [dbo].[Table]    WHERE [name] like '%?%';

Update if none found

 UPDATE [dbo].[Table] SET [name] = REPLACE(CAST([name] AS VARCHAR(1000)),'?','');

#2 Only update name values that do not have regular ?'s.

UPDATE [dbo].[Table]  
SET [name] = REPLACE(CAST([name] AS VARCHAR(1000)),'?','')
WHERE [name] NOT LIKE '%?%';

DB<>Fiddle


#3 Replacing the ? to a value that cannot exist in your data and then replacing it back.

With you guessed it, another REPLACE.

UPDATE [dbo].[Table] 
SET [name] = REPLACE(REPLACE(CAST(REPLACE([name],'?','THISVALUECANNOTOCCURINMYDATA') AS VARCHAR(1000)),'?',''),'THISVALUECANNOTOCCURINMYDATA','?')

ReplaceCeption.

DB<>Fiddle


If there are no other, more foolproof, get all unicode values and replace these workarounds. I would use #3 just to be sure that no original ?s are replaced.