Why is the UTF-8 document raising UTF-8 encoding errors in Azure Data Lake Analytics

azureencodinghadoopimportunicode

I have a document that was compressed in gunzip from a unknown source system. It was downloaded and decompressed using a 7zip console application. The document is a CSV file that appears to be encoded in UTF-8.

It's then uploaded to Azure Data Lake Store right after compression. Then there is a U-SQL job setup to simply copy it from one folder to another folder. This process fails and raises a UTF-8 encoding error for a value: ée

Testing

I downloaded the document from the store and removed all records but that one with the value flagged by Azure. In Notepad++, it shows the document as UTF-8. I save the document as UTF-8 again and upload it back to the store. I run the process again and the process succeeds with that value as UTF-8

What am I missing here? Is it possible the original document is not truly UTF-8? Is there something else causing a false positive? I'm a bit baffled.

Possibilities

  • The document is not truly UTF-8 and needs to be recoded
  • Maybe the method that's uploading the file is recoding it
  • Maybe 7zip is recoding it incorrectly

Environment/Tools

  • Windows Server
  • Python 2.7
  • Azure Data Lake Store
  • Azure Data Lake Analytics
  • 7Zip.exe
  • gz
  • Azure API

USQL

Just the base USQL job that defines the schema then selects all fields to a new directory. No transformation happening outside of leaving out the headers. The file is CSV, comma delimited with double quotes on strings. Schema is all strings regardless of data type. Extractors tried is TEXT and CSV with both set to be encoded:UTF8 even though both are default to UTF8 according to Azure documentation on the system.

Other Notes

  1. This same document was uploaded in the past to BLOB storage and imported in the same fashion into Azure Data Warehouse without errors via Polybase.
  2. The value that causes the UTF-8 encoding error is a URL mangled among 1 million other records.
  3. It looks like there are ASCII characters coming in even though it's a UTF-8 document.
  4. When I convert it to ANSI and use the ASCII extractor the file succeeds.
  5. Azure Data Lake Analytics does not allow you to ignore the error as it's an encoding issue. I'd be happy invalidating the record all together like you can in Azure Data Warehouse.

Best Answer

Notes

There are several things to go over here:

  1. Just to get this out of the way: gz[ip] and 7zip.exe have nothing to do with any of this. Compression does not alter encoding or any of the original bytes (else compression / decompression wouldn't be reliable). Well, theoretically there can be a bug in one of them such that the decompressed output is somehow slightly different, but that would be a wide-spread problem, I would think, and these algorithms, and these two tools in particular, have been around for a long time and are known to be reliable.

  2. Keep in mind: files, even text files, contain bytes, not characters. Those bytes can represent some set of characters, or another set, or something else. But in the end, it is just a collection of bytes. So if you see Ã, that is not because there is an à in the file, it is due to there by one or more bytes that are currently being interpreted as representing an Ã. It could be that those same bytes should truly be interpreted as representing something else. And, at the same time, it could also be that some other sequence of bytes, under a different interpretation, also represent Ã.

  3. Then there is a U-SQL job setup to simply copy it from one folder to another folder.

    Ok, this right here should raise a red flag: why would "simply copying a file" produce an encoding error? There can only be an encoding error if the file is being read, else it's just a collection of bytes moving from point A to point B (or perhaps it's merely being relinked and not moving, but either way).

  4. This process fails and raises a UTF-8 encoding error for a value: ée

    Here is another key indicator (that is likely being misinterpreted by some): the error is a UTF-8 encoding error. This tells us two things:

    1. The file is being interpreted as UTF-8 already, so the byte sequences are not being interpreted as some other encoding (e.g. Windows-1252, or ISO-8859-1). This means that the é characters are UTF-8 encoded bytes already (i.e. 0xC383C2A9), and not Windows-1252 bytes (0xC3A9) that should instead be interpreted as UTF-8 to produce é.
    2. The fact that it's an error means that ée is not the error, because if it was the error, then it wouldn't be able to display it. An encoding error is when a byte sequence can't produce a character in that encoding. Meaning: there is something missing from the ée sequence. The ée sequence should be the remaining parts of what could be decoded from the bytes in the file. Keep in mind that Unicode is a specification, not software. The implementation is left up to each vendor. There are guidelines regarding compliant implementations, and those include how to handle decoding errors. But, there is some flexibility as well. So some implementations might throw a hard-error in this case, or others might display the "replacement" character, �, or some might display nothing in that position. In either case, the ée sequence, taken literally, is most likely a red-herring and not what we should be looking at.
  5. In Notepad++, it shows the document as UTF-8. I save the document as UTF-8 again and upload it back to the store. I run the process again and the process succeeds with that value as UTF-8

    Ok. So, in Notepad++, when it shows "UTF-8" on the right side of the bottom bar upon opening a file, that is not necessarily a guarantee that the file is, in fact, encoded as UTF-8. That is a best-guess, probably based on common byte sequences. If the encoding indicator instead showed "UTF-8-BOM", then that would be a guarantee that the file was encoded as UTF-8. The "-BOM" indicates that a Byte Order Mark is present. a BOM is an optional 2 - 4 byte sequence at the beginning of a file that should be hidden but indicates the encoding of the file. This is only available for Unicode. There are two 2-byte BOMs for UTF-16, one for Little Endian and one for Big Endian. There are also two 4-byte BOMs for UTF-32, one for each Endianness. UTF-8 only has a single 3-byte BOM since Endianness does not apply to UTF-8 (since the code units are a single byte, there is only one order that the byte can be in). Of course, having a BOM does not guarantee that it will be respected or even understood by whatever is reading the file. In the case presented here, having a BOM should not have any affect on the outcome since the file is already being read as a UTF-8 file. And again, if Notepad++ is showing ée and shows "UTF-8" in the lower-right corner, then it's already UTF-8 instead of Windows-1252 / ISO-8859-1 that needs to be told to be read as UTF-8 in order to produce é.

  6. If Notepad++ thought that the file was UTF-8, then saving the file as UTF-8 shouldn't have really changed anything. Something else must have changed.
  7. The file is CSV, comma delimited with double quotes on strings.

    and:

    It's a URL mangled among 1 million other records.

    So, is ée the entire value for that field? Is it within double-quotes in the file? ée does not appear to be a URL ;-)

    Also, does the value shown in the error match, exactly, the value you see in the file? If so, there is likely a hidden character in the file that is causing this problem.

  8. Looks like there are ASCII characters coming in even though it's a UTF-8 document.

    What do you mean by "ASCII" characters? One of the features of UTF-8 is that the encoding of the first 128 code points (U+0000 through U+007F) are identical to standard ASCII. This was the primary design goal of UTF-8: full ACSII compatibility (not 8-bit Extended ASCII, just 7-bit standard ASCII). So, if the majority of the document is using US English characters and punctuation, then yes, you should expect to see that just the same whether you open the file as ASCII or UTF-8.

  9. When I convert it to ANSI and use the ASCII extractor the file succeeds.

    This is not surprising. I don't think it's possible to get an ANSI / ASCII encoding error. There are only a few code points / values that are undefined in code page Windows-1252, and they generally behave as "hidden" characters.

    Of course, if you do have valid UTF-8 encoded characters in that file, converting to ANSI will change them into "?" if they are not available in Windows-1252.

Take-aways

  1. If this were an ANSI / Windows-1252 / ISO-8859-1 encoded file such that é indicated a byte sequence of 0xC3A9, then there would be no UTF-8 encoding error since the byte sequence of 0xC3A9 is valid UTF-8 for é.
  2. Since the error itself cannot be shown (else it wouldn't be an encoding error):
    1. ée isn't the error, but instead a result of the error
    2. We need to see the exact bytes of the row that has the encoding error in order to see what could not be decoded.
  3. Nothing says that a file needs to be imported as UTF-8. If a file (even if not this one) is encoded as Windows-1252, ISO-8859-1, or anything else, and the data is all correct in the file, then rather than trying to force the file to be UTF-8, it is better to tell whatever process / tool is reading the file what the actual file encoding is. There is no reason to not import a Windows-1252 encoded file as Windows-1252.

Bottom-line is: there is too much guessing / wild-goose-chase without seeing the original bytes of the line that has the encoding error. Having that line will help determine both a) what happened, and b) how to proceed.

If the poster of the bounty could add a comment to this answer providing additional details, especially the exact byte sequence that is getting the error, that would help solve this.