Why is the UTF-8 document raising UTF-8 encoding errors in Azure Data Lake Analytics

azureencodinghadoopimportunicode

I have a document that was compressed in gunzip from a unknown source system. It was downloaded and decompressed using a 7zip console application. The document is a CSV file that appears to be encoded in UTF-8.

It's then uploaded to Azure Data Lake Store right after compression. Then there is a U-SQL job setup to simply copy it from one folder to another folder. This process fails and raises a UTF-8 encoding error for a value: Ã©e

Testing

I downloaded the document from the store and removed all records but that one with the value flagged by Azure. In Notepad++, it shows the document as UTF-8. I save the document as UTF-8 again and upload it back to the store. I run the process again and the process succeeds with that value as UTF-8

What am I missing here? Is it possible the original document is not truly UTF-8? Is there something else causing a false positive? I'm a bit baffled.

Possibilities

The document is not truly UTF-8 and needs to be recoded
Maybe the method that's uploading the file is recoding it
Maybe 7zip is recoding it incorrectly

Environment/Tools

Windows Server
Python 2.7
Azure Data Lake Store
Azure Data Lake Analytics
7Zip.exe
gz
Azure API

USQL

Just the base USQL job that defines the schema then selects all fields to a new directory. No transformation happening outside of leaving out the headers. The file is CSV, comma delimited with double quotes on strings. Schema is all strings regardless of data type. Extractors tried is TEXT and CSV with both set to be encoded:UTF8 even though both are default to UTF8 according to Azure documentation on the system.

Other Notes

This same document was uploaded in the past to BLOB storage and imported in the same fashion into Azure Data Warehouse without errors via Polybase.
The value that causes the UTF-8 encoding error is a URL mangled among 1 million other records.
It looks like there are ASCII characters coming in even though it's a UTF-8 document.
When I convert it to ANSI and use the ASCII extractor the file succeeds.
Azure Data Lake Analytics does not allow you to ignore the error as it's an encoding issue. I'd be happy invalidating the record all together like you can in Azure Data Warehouse.

Best Answer

Notes

There are several things to go over here:

Just to get this out of the way: gz[ip] and 7zip.exe have nothing to do with any of this. Compression does not alter encoding or any of the original bytes (else compression / decompression wouldn't be reliable). Well, theoretically there can be a bug in one of them such that the decompressed output is somehow slightly different, but that would be a wide-spread problem, I would think, and these algorithms, and these two tools in particular, have been around for a long time and are known to be reliable.
Keep in mind: files, even text files, contain bytes, not characters. Those bytes can represent some set of characters, or another set, or something else. But in the end, it is just a collection of bytes. So if you see Ã, that is not because there is an Ã in the file, it is due to there by one or more bytes that are currently being interpreted as representing an Ã. It could be that those same bytes should truly be interpreted as representing something else. And, at the same time, it could also be that some other sequence of bytes, under a different interpretation, also represent Ã.
Then there is a U-SQL job setup to simply copy it from one folder to another folder.

Ok, this right here should raise a red flag: why would "simply copying a file" produce an encoding error? There can only be an encoding error if the file is being read, else it's just a collection of bytes moving from point A to point B (or perhaps it's merely being relinked and not moving, but either way).
This process fails and raises a UTF-8 encoding error for a value: Ã©e

Here is another key indicator (that is likely being misinterpreted by some): the error is a UTF-8 encoding error. This tells us two things:
1. The file is being interpreted as UTF-8 already, so the byte sequences are not being interpreted as some other encoding (e.g. Windows-1252, or ISO-8859-1). This means that the Ã© characters are UTF-8 encoded bytes already (i.e. 0xC383C2A9), and not Windows-1252 bytes (0xC3A9) that should instead be interpreted as UTF-8 to produce é.
2. The fact that it's an error means that Ã©e is not the error, because if it was the error, then it wouldn't be able to display it. An encoding error is when a byte sequence can't produce a character in that encoding. Meaning: there is something missing from the Ã©e sequence. The Ã©e sequence should be the remaining parts of what could be decoded from the bytes in the file. Keep in mind that Unicode is a specification, not software. The implementation is left up to each vendor. There are guidelines regarding compliant implementations, and those include how to handle decoding errors. But, there is some flexibility as well. So some implementations might throw a hard-error in this case, or others might display the "replacement" character, �, or some might display nothing in that position. In either case, the Ã©e sequence, taken literally, is most likely a red-herring and not what we should be looking at.
In Notepad++, it shows the document as UTF-8. I save the document as UTF-8 again and upload it back to the store. I run the process again and the process succeeds with that value as UTF-8

Ok. So, in Notepad++, when it shows "UTF-8" on the right side of the bottom bar upon opening a file, that is not necessarily a guarantee that the file is, in fact, encoded as UTF-8. That is a best-guess, probably based on common byte sequences. If the encoding indicator instead showed "UTF-8-BOM", then that would be a guarantee that the file was encoded as UTF-8. The "-BOM" indicates that a Byte Order Mark is present. a BOM is an optional 2 - 4 byte sequence at the beginning of a file that should be hidden but indicates the encoding of the file. This is only available for Unicode. There are two 2-byte BOMs for UTF-16, one for Little Endian and one for Big Endian. There are also two 4-byte BOMs for UTF-32, one for each Endianness. UTF-8 only has a single 3-byte BOM since Endianness does not apply to UTF-8 (since the code units are a single byte, there is only one order that the byte can be in). Of course, having a BOM does not guarantee that it will be respected or even understood by whatever is reading the file. In the case presented here, having a BOM should not have any affect on the outcome since the file is already being read as a UTF-8 file. And again, if Notepad++ is showing Ã©e and shows "UTF-8" in the lower-right corner, then it's already UTF-8 instead of Windows-1252 / ISO-8859-1 that needs to be told to be read as UTF-8 in order to produce é.
If Notepad++ thought that the file was UTF-8, then saving the file as UTF-8 shouldn't have really changed anything. Something else must have changed.
The file is CSV, comma delimited with double quotes on strings.

and:

It's a URL mangled among 1 million other records.

So, is Ã©e the entire value for that field? Is it within double-quotes in the file? Ã©e does not appear to be a URL ;-)

Also, does the value shown in the error match, exactly, the value you see in the file? If so, there is likely a hidden character in the file that is causing this problem.
Looks like there are ASCII characters coming in even though it's a UTF-8 document.

What do you mean by "ASCII" characters? One of the features of UTF-8 is that the encoding of the first 128 code points (U+0000 through U+007F) are identical to standard ASCII. This was the primary design goal of UTF-8: full ACSII compatibility (not 8-bit Extended ASCII, just 7-bit standard ASCII). So, if the majority of the document is using US English characters and punctuation, then yes, you should expect to see that just the same whether you open the file as ASCII or UTF-8.
When I convert it to ANSI and use the ASCII extractor the file succeeds.

This is not surprising. I don't think it's possible to get an ANSI / ASCII encoding error. There are only a few code points / values that are undefined in code page Windows-1252, and they generally behave as "hidden" characters.

Of course, if you do have valid UTF-8 encoded characters in that file, converting to ANSI will change them into "?" if they are not available in Windows-1252.

Take-aways

If this were an ANSI / Windows-1252 / ISO-8859-1 encoded file such that Ã© indicated a byte sequence of 0xC3A9, then there would be no UTF-8 encoding error since the byte sequence of 0xC3A9 is valid UTF-8 for é.
Since the error itself cannot be shown (else it wouldn't be an encoding error):
1. Ã©e isn't the error, but instead a result of the error
2. We need to see the exact bytes of the row that has the encoding error in order to see what could not be decoded.
Nothing says that a file needs to be imported as UTF-8. If a file (even if not this one) is encoded as Windows-1252, ISO-8859-1, or anything else, and the data is all correct in the file, then rather than trying to force the file to be UTF-8, it is better to tell whatever process / tool is reading the file what the actual file encoding is. There is no reason to not import a Windows-1252 encoded file as Windows-1252.

Bottom-line is: there is too much guessing / wild-goose-chase without seeing the original bytes of the line that has the encoding error. Having that line will help determine both a) what happened, and b) how to proceed.

If the poster of the bounty could add a comment to this answer providing additional details, especially the exact byte sequence that is getting the error, that would help solve this.

Related Solutions

Sql-server – Can Unicode columns include non-Unicode values in SQL Server

Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all MUST be UNICODE...

Only the XML and N-prefixed types (NCHAR, NVARCHAR, and NTEXT [which has been deprecated since SQL Server 2005 was released so please do not use it]) are Unicode. Those other types you mentioned are not strings and are not stored as strings, hence they are not relevant to this question.

... and therefore WILL NOT have any characters unable to be converted from UTF-16LE to UTF-8...

This is not exactly a valid question. Unicode characters are Unicode characters regardless of their encoding, whether it is UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE. Now, it is possible for the data itself to contain invalid sequences, such as invalid Surrogate Pairs. But then those aren't valid characters in the UTF-8 or UTF-32 encodings either.

... simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?

Well, you need to be sure to save the initial export file with a Unicode-encoding. So you would use either the -N or -w options with BCP.

Also, make sure that you are doing more than just changing the Byte Order Mark (BOM) of the file and are actually converting the Unicode / UT-16LE characters to UTF-8.

Sql-server – Export to format compatible with SQL Server

There are some issues with this request:

What version and edition of SQL Server is the target system?
How much data is being imported? 10k, 10 Mb, more?
How many CSV files are there?
You have stated that handling of double-quotes is required, implying text-qualified fields and embedded text-qualifiers. Will there also be embedded field delimiters (i.e. , )? Those usually aren't a problem, but will there also be embedded row delimiters (i.e. \r\n or just \n )? These are the biggest problem for most CSV parsers.
In a comment on Scott's answer you raise a concern about: "Double quotes is one example and I expect there are other shortcomings if such a basic feature is missing - unicode, size limits, etc.". Can you please update the question to include these and any other missing requirements?
In a comment on the question, you ask, regarding SSMS: "will this create the required tables or they must be predefined?". Is creating the tables a requirement of this import process? If so, or even if creating the tables is an optional benefit, can it please be stated clearly in the question?
If the tables do not already exist, what needs to happen with regards to Foreign Keys, Indexes, Default Constraints, Check Constraints, etc?
What collation is being used, or at least is desired for the SQL Server tables? Is the plan to simply inherit the current default collation for the database that you are importing into? And to be clear about the term "collation" since it can have slightly different meaning depending on the system you are using, I am speaking of: Locale / LCID / Culture / Code Page (if applicable). It would also help to know if the source is sensitive or insensitive for case, accents, etc, or if a binary collation is being used.
Is there a more reliable intermediary format that could be generated from Linux/OSX? For example MySQL dump, Excel, Sqlite, XML, etc?

The only truly reliable format will be something generated by SQL Server. XML is generally very reliable for transporting the data (this is what XML was meant to do), and doesn't have the parsing issues that are inherent in delimited files (i.e. embedded text qualifiers and delimiters). BUT, you still need to create the tables, and you need to write, and test the parsing of the XML into a result set so that it can be inserted into the tables. HOWEVER, (continued in the next item)...
You stated in that comment on Scott's answer:

Problem is I won't be able to test the import because I don't have direct access to the MS SQL database, so am looking for a more reliable approach.

"Reliability" can only be determined through testing. It doesn't matter what should work, anything can go wrong. For example, many folks are not aware that it is common for XML to have an encoding of UTF-8, yet SQL Server only handles UTF-16 (Little Endian) for XML or even NVARCHAR data. Seeing as how this data is coming from "Linux/OSX", I would expect the default encoding to be UTF-8.

The encoding issue and other nuances should all (well, "mostly") reveal themselves in testing, but you have no way to test. Hence it will be difficult to get a very reliable answer regarding a reliable import mechanism.
What is the source of the data? I assume it is either Linux or OSX as those were mentioned in the question. But is it coming from MySQL specifically (since a "MySQL dump" was also mentioned)? Knowing if the source is MySQL or PostgreSQL or flat files, etc will help determine what tools are both available and best suited to this request.
Regardless of the source, however, keep in mind that the destination (i.e. Windows / SQL Server) is most "comfortable" dealing with UTF-16 (Little Endian) encoded data. So while UTF-8 might be the default output type for Linux/OSX-based systems, if there is an option for using UTF-16 Little Endian for the output encoding, that will help reduce potential issues once the script is moved over to the destination system.

One last thing to consider: You mention "MySQL dump" as a potential "reliable format", and those dumps include the DDL (i.e. CREATE TABLE) and DML (i.e. INSERT) statements. If you have the ability to write a script that contains the DDL statements (if necessary) and DML statements, then you don't have to worry about any formatting issues. Just convert the CSV data into INSERT statements and be done. The only issue you would be left with is not being able to test the scripts. But still, outside of which type of quote or brackets to use for object names and string literals, it's really hard to go wrong with:

INSERT INTO table (column1_name, column2_name, ...) VALUES (column1_val, column2_val, ...);

And to make things even easier:

Enclose table and column names in square brackets: [ and ] (e.g. [Customers]), or double-quotes: " (e.g. "Customers")
Enclose string, date, and GUID literals in regular single-quotes / apostrophes: '. (e.g. 'some text')
If any string literals are for Unicode / NVARCHAR / XML columns, prefix those literals with a capital-"N": N (e.g. N'some unicode text')
Make sure that any date / datetime values are formatted in a compatible way (there are several options and some depend on the current language setting in SQL Server, but generally YYYY-MM-DD HH:mm:ss.mmm should be fine)
Max size/length for CHAR / VARCHAR and BINARY / VARBINARY columns is 8000. Anything larger will need to use VARCHAR(MAX) and VARBINARY(MAX), respectively.
Max size/length for NCHAR / NVARCHAR columns (i.e. Unicode) is 4000. Anything larger will need to use NVARCHAR(MAX).

Since MySQL was mentioned in the question, even if indirectly, I figured it wouldn't hurt to test using mysqldump to see how close it can actually get. I used the following options:

--flush-logs
--compatible=mssql
--no-create-db
--no-tablespaces
--order-by-primary
--default-character-set=utf8
--flush-logs
--skip-set-charset
--hex-blob
--result-file
--skip-add-locks
--skip-triggers
--skip-comments
--skip-disable-keys
--compact
--skip-extended-insert
--skip-add-drop-table
--quote-names

Some notes:

I used utf8 as the charset since utf16le, ucs2, and a few others are not allowed in this particular context :-(
The bottom two options -- --skip-add-drop-table and --quote-names -- might not need to be explicitly stated as they might be implied by --compatible=mssql
At minimum you will need to do some string replacements on the output file to adjust the syntax for SQL Server. In the list below, pay close attention to the space before, and sometimes after, the "before" and "after" strings as they are necessary to prevent matching parts of valid strings.
1. " int(11) " -> " INT "
2. " timestamp " -> " DATETIME "
3. " text " -> " VARCHAR(MAX) "
4. " blob " -> " VARBINARY(MAX) "
5. " DEFAULT NULL" -> " NULL" (no space at the end of either the "before" or "after" strings)
After making the 5 adjustments noted directly above, you are still left with a few potential issues:
1. Did any of the string replacements above accidentally match text within any string columns?
2. If there are any varchar columns with a length of 8001 or more, those lengths will need to be translated to the string MAX so that the resulting data type will be VARCHAR(MAX) instead of VARCHAR(20000).
3. If you have any Unicode data, especially any characters that do not fit into the Code Page of the default collation of the destination database, it will get converted into question marks ( ? ) unless you can find a way to prefix those string literals with a capital-N. Along these same lines, the CREATE TABLE statements will need to be adjusted for any Unicode columns such that they are NCHAR / NVARCHAR instead of CHAR / VARCHAR. Non-Unicode columns can remain as CHAR / VARCHAR.
4. potentially other nuances / differences that I have missed.

P.S. For the sake of completeness as it relates to migrating data from MySQL to SQL Server, Microsoft does have a Migration Assistant, but from what I can tell so far it requires a connection from SQL Server to MySQL and will not work on a dump file.

UPDATE

New info (from comment on this answer):

The source is scraped data from German web pages currently stored in a CSV file. I could easily import it from there into MySQL / Postgres if that was a reliable vector.

So here's a thought. If you can get this data into one RDBMS, then why not another? Assuming you were going to set up a temporary Windows VM for testing anyway, you could just:

Install SQL Server 2008 Express Edition (i.e. free) on the VM
Create a database to hold this data (it will make life much easier if you can find out from your client what default collation their database is using and use the same one for this new database!)
Create the intended schema (tables, etc) for this data
Import this data directly into these tables
Take a BACKUP of the this database
Create another database that will act as your client's database
Write a script that will:
1. RESTORE the backup file into a new database with a new name (not the original name)
2. copy the tables / data from the new database into the "customer" database
3. DROP the new / data migration database
TEST!
Give the backup file and the script that makes use of it to your client.

While this does require that you write a T-SQL script, it does not require any additional software or install, and the data is already in columns of the proper datatypes and doesn't need to be parsed :-).

P.S. Again, for the sake of completeness I was also looking at SQL Server Data Tools (SSDT) as a possible solution as it can export schema and data into a single .dacpac file and import that into a destination. However, this does require installing SSDT. But, you wouldn't have to write a T-SQL script as SSDT handles all of the SQL for you.