SQL Server – Resolve Encoding Issue with VARCHAR Column Retrieved in Python

encodingpythonsql serversql-server-2008-r2utf-8

We recently had an issue with encoding related to a field that's being stored as a varchar(120) in SQL Server. In SSMS, the varchar appears as:

"Who Killed JonBen‚t?"

However, when it's brought into python, it appears as:

enter image description here

I've researched this from the Python side, and nothing strange is going on. My theory is that the varchar in SQL Server is accepting UTF-8 characters which are displaying differently in python than SSMS. I'm not very familiar with encoding in SQL Server. Can someone please let me know the following:

  • Is there a way in SSMS to view the encoding of the varchar? For instance see \x82 instead of displaying the comma as it is currently from SSMS?
  • We're using SQL Server 2008. Is there any way to change the encoding for any UTF-8 characters to ASCII characters without using import /export tools or dumping to a flat file? I.e. can I make this conversion via a query?
  • Is there any way to programmatically identify problematic records via a query (problematic being defined as UTF-8 characters that are not supported via ASCII)?

Thank you in advance!

Using sp_help N'table_name'; I found that the Collation of this VARCHAR column is: SQL_Latin1_General_CP1_CI_AS.

Best Answer

SQL Server does not store UTF-8 under any circumstances. You get either UTF-16 Little Endian (LE) via NVARCHAR (including NCHAR and NTEXT, but don't ever use NTEXT) and XML, or some 8-bit encoding, based on a Code Page, via VARCHAR (including CHAR and TEXT, but don't ever use TEXT).

The problem here is that your code is mistranslating that 0x82 character, thinking that it's UTF-8, but it's not. There is no UTF-8 "character" having a value of 0x82, which is why you get the "unknown" / replacement symbol of "�". Please see the following UTF-8 table which shows that there is no character for a single-byte of 0x82:

UTF-8 encoding table

As stated by the O.P., the Collation of the column in question is SQL_Latin1_General_CP1_CI_AS, which means that the 8-bit encoding is using Code Page 1252, which is Windows Latin 1 (ANSI). And checking that chart (scroll down to the bottom chart as it has the character names) value 0x82 (look for "82" in the "Code Point" column) is in fact the Single Low-9 Quotation Mark that you see in SSMS. That character, in UTF-8, is a 3 byte sequence: E2 80 9A.

What all of this means is: your Python code needs to either set the client-encoding for the SQL Server connection to Code Page 1252, or you need to change / convert the encoding of the returned string from Code Page 1252 to UTF-8.

Of course, if this is being displayed on a web page, then you could change the declared charset of the page to be Windows-1252, but that might interfere with other characters on the page if there are UTF-8 characters already there.