Sql-server – How to fix the Code Page in SSIS Lookup Transformation to be 65001

collationsql serversql-server-2019ssisunicode

I have an SQL server 2019 with databases and tables all set with the collation Latin1_General_100_CI_AS_SC_UTF8.

The relevant table has code and desc columns which are both of data type varchar.

In SSIS project I have a single Data Flow component:

I have a UTF-8 CSV file read with flat file connection, text column code to match is DT_STR, 65001

I have a Lookup that is set to "Full Cache" and loads the Latin1_General_100_CI_AS_SC_UTF8 table, but SSIS thinks the varchar columns are DT_STR, 1252

Finally the code in both CSV and lookup are matched and desc is sent to destination table which is on the same Latin1_General_100_CI_AS_SC_UTF8 collation. The destination component is set to AlwaysUseDefaultCodePage True and DefaultCodePage 65001.

I now get an error saying the column has more than one code page and cannot run the package.

If not for the mislabeled 1252, this package should run. I believe its something to do with ExternalMetadataXml, which is read-only and says all my lookup varchar columns are CodePage="1252".

If I manually edit the package .dtsx with npp and replace all instances of 1252 with 65001, the package can run and seems to do what I expected, as long as I never touch the lookup component again.. That seems a bit messed up of a solution tho, I am hoping there's someone else who has a cleaner way to fix this. Thanks.

edit 2021/03/10: I thought I should clarify, the lookup is using OLE to read a table from the SQL server, and is trying to match code column onto the code column read from the flat file. The problem is not the flat file, which is already recognised as DT_STR, 65001. The problem is the Lookup widget reading from a table that should be 65001, somehow ends up with 1252. If I didn't use OLE, and instead read from a UTF8 flat file to create the Lookup widget, there is no error.

Best Answer

I found these tidbits of information while reading through the collation documentation:

For example, for the OS locale "English (United States)" (code page 1252), the default collation during setup is SQL_Latin1_General_CP1_CI_AS, and it can be changed to its closest Windows collation counterpart, Latin1_General_100_CI_AS_SC.

Reference: Collation and Unicode support (Microsoft | SQL Docs)

...So the collation Latin1_General_100_CI_AS_SC does relate to the Windows code page 1252...

SQL Server 2012 (11.x) introduced a new family of supplementary character (_SC) collations that can be used with the nchar, nvarchar, and sql_variant data types to represent the full Unicode character range (000000–10FFFF). For example: Latin1_General_100_CI_AS_SC or, if you're using a Japanese collation, Japanese_Bushu_Kakusu_100_CI_AS_SC.

Reference: Collation and Unicode support - Supplementary characters (Microsoft | SQL Docs)

... now I know where the _SC comes from and reading on ...

SQL Server 2019 (15.x) introduces full support for the widely used UTF-8 character encoding as an import or export encoding, and as database-level or column-level collation for string data. UTF-8 is allowed in the char and varchar data types, and it's enabled when you create or change an object's collation to a collation that has a UTF8 suffix. One example is changing LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.

Reference: Collation and Unicode support - UTF-8 support (Microsoft | SQL Docs)

... and now we have a matching collation: LATIN1_GENERAL_100_CI_AS_SC_UTF8.

Along the line it all sums up to the Windows code page being 1252 and the SQL Server collation being UTF-8.

Your Fix

Hacking the SSIS package seems to fix the wrong interpretation of the flat file, but only until the package is edited/modified again.

Could it be that the TXT file isn't being interpreted as 100% UTF-8?

TXT files and UTF-8

I saved the first couple of lines of this answer into a .txt file and had a look at the contents in a Hex-Editor:

Hex of Notepad UTF-8

00000000h: 49 20 66 6F 75 6E 64 20 74 68 65 73 65 20 74 69 ; I found these ti
00000010h: 64 62 69 74 73 20 6F 66 20 69 6E 66 6F 72 6D 61 ; dbits of informa
00000020h: 74 69 6F 6E 20 77 68 69 6C 65 20 72 65 61 64 69 ; tion while readi

There doesn't seem to be a UTF-8 BOM header there.

Hex of Notepad ANSI

00000000h: 49 20 66 6F 75 6E 64 20 74 68 65 73 65 20 74 69 ; I found these ti
00000010h: 64 62 69 74 73 20 6F 66 20 69 6E 66 6F 72 6D 61 ; dbits of informa
00000020h: 74 69 6F 6E 20 77 68 69 6C 65 20 72 65 61 64 69 ; tion while readi

So both ANSI and UTF-8 are stored the same way when using Windows NOTEPAD.EXE.

I do know that a UTF-8 file sometimes comes with a BOM (Byte Order Mark), that specifically denotes the file as UTF-8.

This looks like this:

00000000h: EF BB BF 49 20 66 6F 75 6E 64 20 74 68 65 73 65 ; I found these
00000010h: 20 74 69 64 62 69 74 73 20 6F 66 20 69 6E 66 6F ;  tidbits of info
00000020h: 72 6D 61 74 69 6F 6E 20 77 68 69 6C 65 20 72 65 ; rmation while re
00000030h: 61 64 69                                        ; adi

This basically shouts: In case you don't know, I am a UTF-8 encoded file...

But ...

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8. [...] Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

Reference: Byte order mark (Wikipedia)

Possible Conclusion

Going back on the quoted information in the introduction...

...SQL Server 2019 (15.x) introduces full support for the widely used UTF-8 character encoding as an import or export encoding, and as database-level or column-level collation for string data. UTF-8 is allowed ....

What is the definition of full support for the widely used UTF-8 character encoding? With or without BOM? We don't know so let's try another approach.

Possible Solution

I suggest you grab a tool that supports saving TXT files as UTF-8 and which will insert the BOM.

This way you ensure that the SSIS package will interpret the TXT file correctly as UTF-8 and will not convert the package to interpret Code-page 1252. This way you shouldn't encounter any issues with your SSIS package.