Sql-server – How to insert a Unicode character verbose into a varchar DB

collationencodingsql servert-sqlunicode

I need to insert this character '●' into a VARCHAR column of a MSSQL database with collation set as SQL_Latin1_General_CP1_CI_AS (or at least mock what my Python + Windows MSSQL Driver might have done). I tried to insert this from my client, but it gets inserted as ?.

The hex value as checked in Python is \xe2\x97\x8f which is binary 226, 151, 143.

The collation spec says that 226 is defined but 143 and 151 is undefined. So my best would be to insert 226?

The reason I am doing this exercise is that our app was inserting this character into DB earlier. At this point, I don't know how it went in because but it went through a Python App with Windows MSSQL Driver and our DB Team wrote a script to correct it and apparently it got replaced with "?".

Now I am just trying to reproduce this scenario to ask them why it is getting replaced with "?" because we are replacing it as empty in the App. Since my App already has a fix for this issue (we remove anything greater than 127), I am trying to mock this directly from back-end to prove that it is getting replaced with '?' which is actually my claim and I think it is done because it says "COLLATE" in the update query.

UPDATE pr_response
SET nur_respon =
REPLACE (nur_respon,
     SUBSTRING(nur_respon, PATINDEX('%[^ !-~]%' COLLATE Latin1_General_BIN, nur_respon), 1),
         '')
WHERE PATINDEX('%[^ !-~]%' COLLATE Latin1_General_BIN, nur_respon) > 0

Is this a fine script to remove Non-ASCII-Data > 127? Can someone explain this query in plain English?

Best Answer

I need to insert this character '●' into a VARCHAR column ... with collation set as SQL_Latin1_General_CP1_CI_AS ... The reason I am doing this exercise is that our app was inserting this character into DB earlier.

No, the app was not inserting this character (Black Circle U+25CF) into a VARCHAR column using a [SQL_]Latin1_General... Collation. Latin1_General Collations use Code Page 1252 which does not have such a character, nor does there appear to be a "best fit" mapping to convert it to something similar.

That character does exist in some code pages, such as those used by the Korean and Japanese Collations, as the following queries show (both are double-byte character sets, which is why the conversion to VARBINARY shows two bytes instead of one):

SELECT CONVERT(VARCHAR(10), N'●' COLLATE Korean_100_CI_AS),
       CONVERT(VARBINARY(10), CONVERT(VARCHAR(10), N'●' COLLATE Korean_100_CI_AS));
-- ●    0xA1DC

SELECT CONVERT(VARCHAR(10), N'●' COLLATE Japanese_XJIS_100_CI_AS),
       CONVERT(VARBINARY(10), CONVERT(VARCHAR(10), N'●' COLLATE Japanese_XJIS_100_CI_AS));
-- ●    0x819C

SELECT CONVERT(VARCHAR(10), N'●' COLLATE Latin1_General_100_CI_AS),
       CONVERT(VARBINARY(10), CONVERT(VARCHAR(10), N'●' COLLATE Latin1_General_100_CI_AS));
-- ?    0x3F

It is possible, however, that the app was inserting '•' (Bullet U+2022), which looks quite similar, just a little smaller. The "Bullet" character is available in Code Page 1252 (Dec 149 or Hex 0x95).

our DB Team wrote a script to correct it and apparently it got replaced with "?"

Well, replacing "●" with "?" doesn't sound like fixing it ;-).

Now I am just trying to reproduce this scenario to ask them why it is getting replaced with "?" because we are replacing it as empty in the App. Since my App already has a fix for this issue (we remove anything greater than 127)

Regardless of technicalities, it seems fairly obvious both statements cannot be true at the same time: it cannot get replaced with "?" and get removed before hitting SQL Server. Clearly the app code does not have a fix for this, and is not removing everything with a value over 127. It is being converted into "?" in SQL Server because that character is being inserted yet does not exist in Code Page 1252.

I think it is done because it says "COLLATE" in the update query.

No, the forcing of a binary Collation via the COLLATE keyword is not changing this character into "?". Both the column's Collation ( SQL_Latin1_General_CP1_CI_AS ) and the explicit Collation ( Latin1_General_BIN ) use Code Page 1252, so no characters could change due to this.

The "●" character is being replaced with "?" as it is being inserted into the VARCHAR column. And once it is inserted as a question mark, there is nothing to clean up, and no way to identify that question mark as being originally from something else as opposed to an intended question mark.

Is this a fine script to remove Non-ASCII-Data > 127? Can someone explain this query in plain English?

That query finds any occurrences of characters having an ASCII value of greater than 127 (that is what the PATINDEX does), and then gets any such character (that is what the SUBSTRING does), and then replaces all occurrences of that character in the column with empty string (this is what the REPLACE does). The query does not do any UPDATE if no characters are found having an ASCII value over 127.

This query only works on one character at a time. So, if a column has 2 or more characters with ASCII values above 127 that are not the same ASCII value, then the script would need to be executed multiple times.

That query will not help in this particular case (i.e. of replacing "●" with "?") since that conversion is happening on the way in. This query only handles characters with ASCII values of 128 - 255, yet "●" is not one of those characters as it could never be in this column in the first place.