I need to dig into a logs table with a schema similar to this:
CREATE TABLE t (
id int PRIMARY KEY,
data varchar(max)
);
Column data
stores a XML text received from a web service in this format:
This is a reduced version
<?xml version="1.0" encoding="UTF-8"?>
<PARAM>
<TAB DIM="30" ID="ZC3D2_1" SIZE="5">
<LIN NUM = "1">
<FLD NAME = "ZDOC" TYPE = "Char">Ferran López</FLD>
</LIN>
</TAB>
</PARAM>
When I try to cast this text to XML I get next error:
XML parsing: line xx, character 48, illegal xml character
It can be solved by removing the <xml>
tag, or at least, the encoding
attribute.
NOTE: It works fine if there is no special characters like
ó
, even if I don't remove<xml>
tag.
Question
Is there a way to convert it to XML without replacing or removing <xml>
tag?
CAST(REPLACE(data, 'encoding="UTF-8"', '') as XML)
db<>fiddle here
UPDATE
The server collation is: Latin1_General_BIN
But even if I try to change the collation to my usual servers collation, it doesn't work.
SELECT
id,
CAST((data COLLATE Latin1_General_CI_AS) as XML)
FROM
t;
Best Answer
What's happening here is:
XML
type stores data internally as UTF-16 Little Endian (most of the time, at least). It doesn't matter what the source encoding is, the end-result will be UTF-16 LE (and no<xml>
tag, hence noencoding="..."
).XML
:NVARCHAR
data is assumed to be UTF-16 LE. If there is an<xml>
tag and it contains theencoding
attribute, the only valid value is"UTF-16"
.VARCHAR
data is assumed to be in the 8-bit code page associated with the collation of the data when there is no<xml>
tag, or if one exists but there is noencoding
attribute. Else the data will be interpreted as being encoded in the code page specified in theencoding
attribute (even though it is encoded in the code page associated with the collation of the data).Latin1_General_BIN
, it's safe-enough to assume for the moment that the column is using the same collation).ó
character in code page Windows-1252 is: 0xF3.<xml>
tag, however, is declaring that the XML data is encoded as UTF-8.p
, which has a value of 0x70. Hence you get the "illegal xml character" error (because theencoding="UTF-8"
tells the conversion function that the bytes are valid UTF-8 bytes; the conversion doesn't see theó
character).Your options are:
Ideally, the column would be converted to
XML
and theencoding
attribute of the<xml>
tag, or the entire<xml>
tag itself, would be removed on the way in. AND, theXML
datatype can save space if there are repeating element and/or attribute names as it creates a dictionary (lookup list) of names internally and records the structure using the ID values.Set the
[data]
column to use a UTF-8 collation (new in SQL Server 2019, so not an option for you)Set the
[data]
column to beNVARCHAR
, and remove theencoding
attribute of the<xml>
tag, or the entire<xml>
tag.Convert the incoming string into UTF-8 bytes. So the
ó
character is two bytes in UTF-8: 0xC3B3, which appear asó
in Windows-1252.NOTES:
encoding
attribute of the<xml>
tag, or the entire<xml>
tag, is not an option. Sure, it will work in this particular case, but it won't work in all cases due to the column beingVARCHAR
and UTF-8 collations not being available in SQL Server 2014. Hence, any Unicode characters not available in Windows code page 1252 will be converted to?
or??
(depending on BMP character or Supplementary Character):ó
character is not available on the Cyrillic code page. But, there is a "Best Fit" mapping which is why we end up with ano
instead of a?
._100_
level collations. Additionally, anyone working on SQL Server 2012 or newer should ideally be using the_100_
level collation that ends with_SC
(for Supplementary Characters). Finally, when needing a binary collation on SQL Server 2005 or newer, use one ending in_BIN2
(see my post here as to why).