Sql-server – get incorrect characters when decoding a Base64 string to NVARCHAR in SQL Server

collationencodingsql serversql-server-2012unicode

I've been looking at how to decode Base64 with SQL Server and after searching around many solutions online (some from here) seem to be based upon this sort of method.

SELECT CAST(CAST('Base64StringHere' as XML ).value('.','varbinary(max)') AS VARCHAR(250))

When I have ASCII text this works perfectly. However when I have the following French text it gets corrupted (presumably due to the restrictions of VARCHAR).

Où est le café le plus proche?
T8O5IGVzdCBsZSBjYWbDqSBsZSBwbHVzIHByb2NoZT8=

And provides the following output.

Où est le café le plus proche?

I thought the relatively simple fix would be to change the CAST to NVARCHAR but this results in corruption again.

SELECT CAST(CAST('T8O5IGVzdCBsZSBjYWbDqSBsZSBwbHVzIHByb2NoZT8=' as XML ).value('.','varbinary(max)') AS NVARCHAR(250) )

썏₹獥⁴敬挠晡꧃氠⁥汰獵瀠潲档㽥

My search engine skills may be failing me, but I cannot seem to find someone else who has my problem.

Any thoughts?

Best Answer

The problem is that you encoded a UTF-8 encoded string into Base64. So, decoding the Base64 gives you back the original UTF-8 sequence of bytes. SQL Server uses UTF-16 Little Endian only for NVARCHAR data, and even for XML. Hence, ù is the 8-bit version of the two-byte UTF-8 sequence for ù (0xC3 and 0xB9).

Fortunately, it is possible to convert a UTF-8 encoded string into UTF-16, or even into a non-Unicode Code Page (IF the Code Page supports all characters being converted). The trick is to convert the Base64 decoded bytes, in their text representation (even with incorrectly converted characters) into XML. The trick to this trick is that you need to add the <?xml ...> declaration (typically omitted) and specify the source encoding:

DECLARE @Base64Value NVARCHAR(500) = N'T8O5IGVzdCBsZSBjYWbDqSBsZSBwbHVzIHByb2NoZT8=';

DECLARE @BinaryValue VARBINARY(500) =
    CONVERT(XML, @Base64Value).value('.','varbinary(max)');

DECLARE @IntermediaryValue VARCHAR(500) = CONVERT(VARCHAR(500), @BinaryValue);

SELECT @BinaryValue, @IntermediaryValue;
-- 0x4FC3B920657374206C6520636166C3A9206C6520706C75732070726F6368653F
-- Oֳ¹ est le cafֳ© le plus proche?


-- This is to NVARCHAR, which will always work:
SELECT CONVERT(NVARCHAR(500),
    CONVERT(XML, '<?xml version="1.0" encoding="UTF-8"?>' +  @IntermediaryValue)
              );
-- Où est le café le plus proche?


-- This is to VARCHAR, but "success" will depend on the Code Page
-- specified by the default Collation of the current Database:
SELECT CONVERT(VARCHAR(500),
    CONVERT(XML, '<?xml version="1.0" encoding="UTF-8"?>' +  @IntermediaryValue)
              );

 -- In a DB with a Latin1_General Collation it works:
 -- Où est le café le plus proche?


 -- In a DB with a Hebrew Collation, it gets the following error:
 /*
   Msg 6355, Level 16, State 1, Line XXXXX
   Conversion of one or more characters from XML to target collation impossible
 */

Please note that this trick only works for converting from various source encodings into UTF-16 Little Endian (as that is how the XML datatype in SQL Server stores strings internally). This method cannot be used to convert UTF-16 into UTF-8 or some other non-SQL Server-supported encoding.


Below is an Inline-TVF encapsulating the steps shown above:

GO
CREATE FUNCTION dbo.ConvertBase64EncodedUTF8ToUTF16LE
(
  @Base64EncodedUTF8String VARCHAR(8000)
)
RETURNS TABLE
AS RETURN

    SELECT 
        CONVERT(NVARCHAR(500),
                CONVERT(XML,
                        '<?xml version="1.0" encoding="UTF-8"?>' +
                        CONVERT(VARCHAR(500),
                                CONVERT(XML, @Base64EncodedUTF8String)
                                  .value('.','varbinary(max)')
                               )
                       )
               ) AS [DecodedValue];

GO

And then test:

SELECT *
FROM   dbo.ConvertBase64EncodedUTF8ToUTF16LE(
          'T8O5IGVzdCBsZSBjYWbDqSBsZSBwbHVzIHByb2NoZT8=');
-- Où est le café le plus proche?

Just use with CROSS APPLY if doing a set-based operation.