SQL Server – Resolve Encoding Issue with VARCHAR Column Retrieved in Python

encodingpythonsql serversql-server-2008-r2utf-8

We recently had an issue with encoding related to a field that's being stored as a varchar(120) in SQL Server. In SSMS, the varchar appears as:

"Who Killed JonBen‚t?"

However, when it's brought into python, it appears as:

I've researched this from the Python side, and nothing strange is going on. My theory is that the varchar in SQL Server is accepting UTF-8 characters which are displaying differently in python than SSMS. I'm not very familiar with encoding in SQL Server. Can someone please let me know the following:

Is there a way in SSMS to view the encoding of the varchar? For instance see \x82 instead of displaying the comma as it is currently from SSMS?
We're using SQL Server 2008. Is there any way to change the encoding for any UTF-8 characters to ASCII characters without using import /export tools or dumping to a flat file? I.e. can I make this conversion via a query?
Is there any way to programmatically identify problematic records via a query (problematic being defined as UTF-8 characters that are not supported via ASCII)?

Thank you in advance!

Using sp_help N'table_name'; I found that the Collation of this VARCHAR column is: SQL_Latin1_General_CP1_CI_AS.

Best Answer

SQL Server does not store UTF-8 under any circumstances. You get either UTF-16 Little Endian (LE) via NVARCHAR (including NCHAR and NTEXT, but don't ever use NTEXT) and XML, or some 8-bit encoding, based on a Code Page, via VARCHAR (including CHAR and TEXT, but don't ever use TEXT).

The problem here is that your code is mistranslating that 0x82 character, thinking that it's UTF-8, but it's not. There is no UTF-8 "character" having a value of 0x82, which is why you get the "unknown" / replacement symbol of "�". Please see the following UTF-8 table which shows that there is no character for a single-byte of 0x82:

UTF-8 encoding table

As stated by the O.P., the Collation of the column in question is SQL_Latin1_General_CP1_CI_AS, which means that the 8-bit encoding is using Code Page 1252, which is Windows Latin 1 (ANSI). And checking that chart (scroll down to the bottom chart as it has the character names) value 0x82 (look for "82" in the "Code Point" column) is in fact the Single Low-9 Quotation Mark that you see in SSMS. That character, in UTF-8, is a 3 byte sequence: E2 80 9A.

What all of this means is: your Python code needs to either set the client-encoding for the SQL Server connection to Code Page 1252, or you need to change / convert the encoding of the returned string from Code Page 1252 to UTF-8.

Of course, if this is being displayed on a web page, then you could change the declared charset of the page to be Windows-1252, but that might interfere with other characters on the page if there are UTF-8 characters already there.

Related Solutions

Sql-server – Can Unicode columns include non-Unicode values in SQL Server

Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all MUST be UNICODE...

Only the XML and N-prefixed types (NCHAR, NVARCHAR, and NTEXT [which has been deprecated since SQL Server 2005 was released so please do not use it]) are Unicode. Those other types you mentioned are not strings and are not stored as strings, hence they are not relevant to this question.

... and therefore WILL NOT have any characters unable to be converted from UTF-16LE to UTF-8...

This is not exactly a valid question. Unicode characters are Unicode characters regardless of their encoding, whether it is UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE. Now, it is possible for the data itself to contain invalid sequences, such as invalid Surrogate Pairs. But then those aren't valid characters in the UTF-8 or UTF-32 encodings either.

... simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?

Well, you need to be sure to save the initial export file with a Unicode-encoding. So you would use either the -N or -w options with BCP.

Also, make sure that you are doing more than just changing the Byte Order Mark (BOM) of the file and are actually converting the Unicode / UT-16LE characters to UTF-8.

Sql-server – Encoding Debug UTF8 & Latin 1

The reason your function takes ages is because you have empty values for actual in UTF8Encoding. The patindex expression returns 1 when you check for an empty actual so you never exit the inner loop. You can fix that by adding and actual <> '' to the query against UTF8Encoding. Next issue is where you use @expected as parameter to nchar(). The parameter should be an integer so if you remove nchar() your code returns something but I don't think it is what you are looking for. WilcoxonÃƒÆ’Ã†â€™ is translated to WilcoxonÁƒƒÁ††™.

Another approach you can try is to use the XML capabilities in SQL Server. XML in SQL Server is UTF-16 but it is able to load UTF-8 encoded strings and that can be used.

Concatenate your string with a UTF-8 xml declaration and use the value() function to fetch the value from the constructed XML.

I guess you eventually want to use this on a table so here is an example that uses a table variable.

declare @T table(InputString varchar(max))

insert into @T values
('Ã¥Ã¤Ã¶Ã…Ã„Ã–'),
('WilcoxonÃƒÆ’Ã†â€™')

select cast('<?xml version="1.0" encoding="UTF-8"?>'+T.InputString as xml).value('text()[1]', 'nvarchar(max)') as Value
from @T as T

Result:

Value
---------------
åäöÅÄÖ
WilcoxonÃƒÆ’

Best Answer

Related Solutions

Sql-server – Can Unicode columns include non-Unicode values in SQL Server

Sql-server – Encoding Debug UTF8 & Latin 1

Related Question