I have an SQL server 2019 with databases and tables all set with the collation Latin1_General_100_CI_AS_SC_UTF8.
The relevant table has code
and desc
columns which are both of data type varchar.
In SSIS project I have a single Data Flow component:
I have a UTF-8 CSV file read with flat file connection, text column code
to match is DT_STR, 65001
I have a Lookup that is set to "Full Cache" and loads the Latin1_General_100_CI_AS_SC_UTF8 table, but SSIS thinks the varchar columns are DT_STR, 1252
Finally the code
in both CSV and lookup are matched and desc
is sent to destination table which is on the same Latin1_General_100_CI_AS_SC_UTF8 collation. The destination component is set to AlwaysUseDefaultCodePage True
and DefaultCodePage 65001
.
I now get an error saying the column has more than one code page and cannot run the package.
If not for the mislabeled 1252, this package should run. I believe its something to do with ExternalMetadataXml
, which is read-only and says all my lookup varchar columns are CodePage="1252"
.
If I manually edit the package .dtsx with npp and replace all instances of 1252 with 65001, the package can run and seems to do what I expected, as long as I never touch the lookup component again.. That seems a bit messed up of a solution tho, I am hoping there's someone else who has a cleaner way to fix this. Thanks.
edit 2021/03/10: I thought I should clarify, the lookup is using OLE to read a table from the SQL server, and is trying to match code
column onto the code
column read from the flat file. The problem is not the flat file, which is already recognised as DT_STR, 65001
. The problem is the Lookup widget reading from a table that should be 65001, somehow ends up with 1252. If I didn't use OLE, and instead read from a UTF8 flat file to create the Lookup widget, there is no error.
Best Answer
I found these tidbits of information while reading through the collation documentation:
Reference: Collation and Unicode support (Microsoft | SQL Docs)
...So the collation Latin1_General_100_CI_AS_SC does relate to the Windows code page 1252...
Reference: Collation and Unicode support - Supplementary characters (Microsoft | SQL Docs)
... now I know where the _SC comes from and reading on ...
Reference: Collation and Unicode support - UTF-8 support (Microsoft | SQL Docs)
... and now we have a matching collation: LATIN1_GENERAL_100_CI_AS_SC_UTF8.
Along the line it all sums up to the Windows code page being 1252 and the SQL Server collation being UTF-8.
Your Fix
Hacking the SSIS package seems to fix the wrong interpretation of the flat file, but only until the package is edited/modified again.
Could it be that the TXT file isn't being interpreted as 100% UTF-8?
TXT files and UTF-8
I saved the first couple of lines of this answer into a .txt file and had a look at the contents in a Hex-Editor:
Hex of Notepad UTF-8
There doesn't seem to be a UTF-8 BOM header there.
Hex of Notepad ANSI
So both ANSI and UTF-8 are stored the same way when using Windows NOTEPAD.EXE.
I do know that a UTF-8 file sometimes comes with a BOM (Byte Order Mark), that specifically denotes the file as UTF-8.
This looks like this:
This basically shouts: In case you don't know, I am a UTF-8 encoded file...
But ...
Reference: Byte order mark (Wikipedia)
Possible Conclusion
Going back on the quoted information in the introduction...
What is the definition of full support for the widely used UTF-8 character encoding? With or without BOM? We don't know so let's try another approach.
Possible Solution
I suggest you grab a tool that supports saving TXT files as UTF-8 and which will insert the BOM.
This way you ensure that the SSIS package will interpret the TXT file correctly as UTF-8 and will not convert the package to interpret Code-page 1252. This way you shouldn't encounter any issues with your SSIS package.