Sql-server – Convert Byte Array From XML to VARBINARY

sql servervarbinaryxml

I receive image files as XML data, each byte of the image being a node with its decimal value, e.g. for this example .png file, 1 , the xml I get is:

DECLARE @xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<XmlData>
    <Element>
        <id>Test</id>
        <image>
            <Element>137</Element><Element>80</Element><Element>78</Element><Element>71</Element><Element>13</Element><Element>10</Element><Element>26</Element><Element>10</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>13</Element><Element>73</Element><Element>72</Element><Element>68</Element><Element>82</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>20</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>20</Element><Element>8</Element><Element>6</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>141</Element><Element>137</Element><Element>29</Element><Element>13</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>4</Element><Element>103</Element><Element>65</Element><Element>77</Element><Element>65</Element><Element>0</Element><Element>0</Element><Element>177</Element><Element>143</Element><Element>11</Element><Element>252</Element><Element>97</Element><Element>5</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>9</Element><Element>112</Element><Element>72</Element><Element>89</Element><Element>115</Element><Element>0</Element><Element>0</Element><Element>14</Element><Element>193</Element><Element>0</Element><Element>0</Element><Element>14</Element><Element>193</Element><Element>1</Element><Element>184</Element><Element>145</Element><Element>107</Element><Element>237</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>24</Element><Element>116</Element><Element>69</Element><Element>88</Element><Element>116</Element><Element>83</Element><Element>111</Element><Element>102</Element><Element>116</Element><Element>119</Element><Element>97</Element><Element>114</Element><Element>101</Element><Element>0</Element><Element>112</Element><Element>97</Element><Element>105</Element><Element>110</Element><Element>116</Element><Element>46</Element><Element>110</Element><Element>101</Element><Element>116</Element><Element>32</Element><Element>52</Element><Element>46</Element><Element>48</Element><Element>46</Element><Element>54</Element><Element>252</Element><Element>140</Element><Element>99</Element><Element>223</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>108</Element><Element>73</Element><Element>68</Element><Element>65</Element><Element>84</Element><Element>56</Element><Element>79</Element><Element>99</Element><Element>24</Element><Element>5</Element><Element>184</Element><Element>192</Element><Element>114</Element><Element>32</Element><Element>254</Element><Element>14</Element><Element>196</Element><Element>30</Element><Element>96</Element><Element>30</Element><Element>21</Element><Element>192</Element><Element>126</Element><Element>32</Element><Element>254</Element><Element>15</Element><Element>196</Element><Element>9</Element><Element>96</Element><Element>30</Element><Element>21</Element><Element>192</Element><Element>48</Element><Element>55</Element><Element>80</Element><Element>7</Element><Element>136</Element><Element>29</Element><Element>208</Element><Element>240</Element><Element>121</Element><Element>32</Element><Element>6</Element><Element>25</Element><Element>216</Element><Element>142</Element><Element>36</Element><Element>6</Element><Element>195</Element><Element>34</Element><Element>64</Element><Element>140</Element><Element>23</Element><Element>128</Element><Element>98</Element><Element>19</Element><Element>164</Element><Element>153</Element><Element>88</Element><Element>60</Element><Element>27</Element><Element>136</Element><Element>241</Element><Element>130</Element><Element>213</Element><Element>64</Element><Element>12</Element><Element>242</Element><Element>34</Element><Element>50</Element><Element>126</Element><Element>15</Element><Element>196</Element><Element>32</Element><Element>205</Element><Element>215</Element><Element>145</Element><Element>196</Element><Element>96</Element><Element>56</Element><Element>3</Element><Element>136</Element><Element>73</Element><Element>6</Element><Element>32</Element><Element>141</Element><Element>32</Element><Element>3</Element><Element>71</Element><Element>147</Element><Element>13</Element><Element>249</Element><Element>128</Element><Element>234</Element><Element>6</Element><Element>250</Element><Element>0</Element><Element>113</Element><Element>5</Element><Element>16</Element><Element>43</Element><Element>128</Element><Element>121</Element><Element>163</Element><Element>0</Element><Element>1</Element><Element>24</Element><Element>24</Element><Element>0</Element><Element>127</Element><Element>60</Element><Element>48</Element><Element>197</Element><Element>152</Element><Element>102</Element><Element>243</Element><Element>130</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>0</Element><Element>73</Element><Element>69</Element><Element>78</Element><Element>68</Element><Element>174</Element><Element>66</Element><Element>96</Element><Element>130</Element>
        </image>
    </Element>
</XmlData>'

In binary:

SELECT * FROM OPENROWSET(BULK 'C:\test.png', SINGLE_BLOB) AS q;
==========
BulkColumn
----------
0x89504E470D0A1A0A0000000D49484452000000140000001408060000008D891D0D0000000467414D410000B18F0BFC6105000000097048597300000EC100000EC101B8916BED0000001874455874536F667477617265007061696E742E6E657420342E302E36FC8C63DF0000006C49444154384F631805B8C07220FE0EC41E601E15C07E20FE0FC409601E15C030375007881DD0F079200619D88E2406C322408C17806213A499583C1B88F182D5400CF222327E0FC420CDD791C4603803884906208D200347930DF980EA06FA007105102B8079A300011818007F3C30C59866F3820000000049454E44AE426082

How to I retrieve the image file from the xml as varbinary?

I've asked a similar question not long ago, so I tried the following query, but the resulting binary data is incorrect:

SELECT r.c.value('id[1]', 'varchar(50)') AS id,
CONVERT(VARBINARY(MAX), (SELECT (t.u.value('.','tinyint')) FROM r.c.nodes('image/Element') AS t(u) FOR XML PATH(''))) AS image
FROM @xml.nodes('/XmlData/Element') AS r(c);
=============
id      image
-------------
Test    0x31003300370038003000370038003700310031003300310030003200360031003000300030003000310033003700330037003200360038003800320030003000300032003000300030003000320030003800360030003000300031003400310031003300370032003900310033003000300030003400310030003300360035003700370036003500300030003100370037003100340033003100310032003500320039003700350030003000300039003100310032003700320038003900310031003500300030003100340031003900330030003000310034003100390033003100310038003400310034003500310030003700320033003700300030003000320034003100310036003600390038003800310031003600380033003100310031003100300032003100310036003100310039003900370031003100340031003000310030003100310032003900370031003000350031003100300031003100360034003600310031003000310030003100310031003600330032003500320034003600340038003400360035003400320035003200310034003000390039003200320033003000300030003100300038003700330036003800360035003800340035003600370039003900390032003400350031003800340031003900320031003100340033003200320035003400310034003100390036003300300039003600330030003200310031003900320031003200360033003200320035003400310035003100390036003900390036003300300032003100310039003200340038003500350038003000370031003300360032003900320030003800320034003000310032003100330032003600320035003200310036003100340032003300360036003100390035003300340036003400310034003000320033003100320038003900380031003900310036003400310035003300380038003600300032003700310033003600320034003100310033003000320031003300360034003100320032003400320033003400350030003100320036003100350031003900360033003200320030003500320031003500310034003500310039003600390036003500360033003100330036003700330036003300320031003400310033003200330037003100310034003700310033003200340039003100320038003200330034003600320035003000300031003100330035003100360034003300310032003800310032003100310036003300300031003200340032003400300031003200370036003000340038003100390037003100350032003100300032003200340033003100330030003000300030003000370033003600390037003800360038003100370034003600360039003600310033003000

Best Answer

This is close but missing a few pieces. You extract into rows of TINYINT the decimal value from each <Element> in the XML (e.g. 137, 80, 78, etc), but then the FOR XML PATH('') converts them back into strings and concatenates them, leaving you with a UTF-16 encoded string of "1378078...". Converting that into VARBINARY just turns each string digit -- "1", "3", "7", "8", etc -- into its binary / hex Code Point:

SELECT CONVERT(VARBINARY(20), N'1378078');
-- 0x3100330037003800300037003800
-- 0x 3100 3300 3700 3800 3000 3700 3800 -- each character separated for readability

-- XML in SQL Server is encoded as UTF-16, same as NCHAR / NVARCHAR.
-- Each of these characters in UTF-16 is two bytes: 0x31 + 0x00, 0x33 + 0x00, etc.
-- Each pair of bytes is in reverse order due to "endianness". 0x3100 is really 0x0031.

SELECT NCHAR(0x0031), NCHAR(0x0033), NCHAR(0x0037), NCHAR(0x0038), NCHAR(0x0030),
       NCHAR(0x0037), NCHAR(0x0038);
-- 1    3   7   8   0   7   8

Instead, you need to do the following:

  1. Convert the decimal / TINYINT "137" into hex / BINARY "0x89"
  2. Convert the hex/binary "0x89" into a string / VARCHAR, but without the leading "0x" (this requires using the CONVERT function, not CAST, so that you can specify the "style" of 2)
  3. Once the FOR XML PATH('') puts everything together in a string in the form of 89504E470D0A1A0A0000..., then you need to apply the "style" of 2 again in the CONVERT(VARBINARY(MAX), ... so that it knows that 89504E470D0A1A0A0000... is merely 0x89504E470D0A1A0A0000... without the leading "0x".

Putting those pieces into your query, we get the following:

SELECT r.c.value('id[1]', 'varchar(50)') AS [id],
       CONVERT(VARBINARY(MAX),
               (SELECT CONVERT(VARCHAR(3),
                               CONVERT(BINARY(1),
                                       t.u.value('.', 'tinyint')
                                      ),
                               2 -- style creates binary string without the leading "0x"
                              )
                FROM r.c.nodes('image/Element') AS t(u)
                FOR XML PATH('')
               ),
               2 -- style creates binary string without the leading "0x"
              ) AS [image]
FROM @xml.nodes('/XmlData/Element') AS r(c);

And that returns the following for the image field:

0x89504E470D0A1A0A0000000D49484452000000140000001408060000008D891D0D0000000467414D410000B18F0BFC6105000000097048597300000EC100000EC101B8916BED0000001874455874536F667477617265007061696E742E6E657420342E302E36FC8C63DF0000006C49444154384F631805B8C07220FE0EC41E601E15C07E20FE0FC409601E15C030375007881DD0F079200619D88E2406C322408C17806213A499583C1B88F182D5400CF222327E0FC420CDD791C4603803884906208D200347930DF980EA06FA007105102B8079A300011818007F3C30C59866F3820000000049454E44AE426082


Something to consider:

The method of sending binary data via

<Element>137</Element><Element>80</Element>...

is probably the worst / least-efficient method possible. I realize that you said that you are receiving this info in this format and so probably are not to blame for this and have no control over it. However, just so everyone understands what is going on (please see UPDATE section below), each byte of the binary data is:

  1. first turned into its decimal equivalent ( values 0 - 255 )
  2. then converted into a string ( taking up 1 - 3 characters )
  3. a string which is stored as UTF-16 which is 2 bytes per characters for these characters ( 2 - 6 bytes )
  4. and wrapped in <Element> and </Element> tags { why not <Byte> ? } ( 19 characters )
  5. which, again, is stored in UTF-16 which is 2 bytes per character for these characters ( 38 bytes )
  6. ultimately taking up 40 - 46 bytes ( 38 + 2 on the low-end, 38 + 6 on the high-end ) per each byte of the original binary value.

How does this work out? Well, only 10 decimal values ( 0 - 9 ) are 1 character / 2 bytes. Another 90 ( 10 - 99 ) are 2 characters / 4 bytes, while the remaining 156 values ( 100 - 255 ) are 3 characters / 6 bytes. So the majority of possible values take up the full 6 bytes, and only a small minority take up the minimum 2 bytes. This means that the average space taken up per each original byte is probably between 2 and 3 characters / 4 - 6 bytes (I guess they call that "5" in some places ;-) ? ).

For your particular example data, you can run the following query to see the breakdown:

;WITH cte AS
(  SELECT LEN(CONVERT(VARCHAR(3),  r.c.value('.', 'tinyint'))) AS [Length]
   FROM @xml.nodes('/XmlData/Element/image/Element') AS r(c)
)
SELECT cte.[Length] AS [ElementLength], COUNT(*) AS [ElementCount]
FROM   cte
GROUP BY cte.[Length];

That returns:

ElementLength    ElementCount
-------------    ------------
1                55
2                98
3                85

Now we can multiply each ElementCount by (ElementLength * 2) to get the number of bytes. And we need to factor in the <Element> tags, which is again 38 bytes per each of the 238 original bytes:

SELECT (55 * 2) + (98 * 4) + (85 * 6) + (238 * 38)
-- 10,056 bytes !!!

And to put that into more concrete terms, we should compare that to the original binary size to get a sense of the bloat:

SELECT 10056 / 238.0 -- 42.25 times larger !!!

Meaning, if you are sent a 1 MB image (not unreasonable), that will be represented by 42.25 MB of XML. Yikes! (please see UPDATE section below)

But before anyone starts complaining about XML, this is not the fault of XML, which while being an undeniably bloated format, can do much better than this. XML (at least natively in SQL Server) supports the ability to handle binary data using Base64 encoding/decoding. For example, using the same test PNG binary value, we can convert that to a string in XML using the FOR XML PATH clause:

DECLARE @PngImage VARBINARY(MAX);
SET @PngImage = 0x89504E470D0A1A0A0000000D49484452000000140000001408060000008D891D0D0\
000000467414D410000B18F0BFC6105000000097048597300000EC100000EC101B8916BED000000187445\
5874536F667477617265007061696E742E6E657420342E302E36FC8C63DF0000006C49444154384F63180\
5B8C07220FE0EC41E601E15C07E20FE0FC409601E15C030375007881DD0F079200619D88E2406C322408C\
17806213A499583C1B88F182D5400CF222327E0FC420CDD791C4603803884906208D200347930DF980EA0\
6FA007105102B8079A300011818007F3C30C59866F3820000000049454E44AE426082;

SELECT @PngImage AS [PngImage]
FOR XML PATH('Test'), BINARY BASE64;

which gives us:

<Test>
  <PngImage>iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAAABGdBTUEAALGPC/xhBQAAAAlwSFlzAAAOwQAADsEBuJFr7QAAABh0RVh0U29mdHdhcmUAcGFpbnQubmV0IDQuMC42/Ixj3wAAAGxJREFUOE9jGAW4wHIg/g7EHmAeFcB+IP4PxAlgHhXAMDdQB4gd0PB5IAYZ2I4kBsMiQIwXgGITpJlYPBuI8YLVQAzyIjJ+D8QgzdeRxGA4A4hJBiCNIANHkw35gOoG+gBxBRArgHmjAAEYGAB/PDDFmGbzggAAAABJRU5ErkJggg==</PngImage>
</Test>

And doing a DATALENGTH(N'iVBORw0KGg...') gives us 640 bytes, and including the <PngImage> tag ( DATALENGTH(N'<PngImage>iVBORw0KGg...</PngImage>') ) gives us a grand total of 682 bytes, not characters, but total bytes (only 2.87 times larger than the original) as compared to the 10,056 total bytes of the current method.

And how much effort does it take to get the VARBINARY(MAX) value out of that Base64 encoded string? Absolutely NONE:

DECLARE @PngImage XML = N'<Test>
  <PngImage>iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAAABGdBTUEAALGPC/xhBQAAAAlwSFlzAAAOwQAADsEBuJFr7QAAABh0RVh0U29mdHdhcmUAcGFpbnQubmV0IDQuMC42/Ixj3wAAAGxJREFUOE9jGAW4wHIg/g7EHmAeFcB+IP4PxAlgHhXAMDdQB4gd0PB5IAYZ2I4kBsMiQIwXgGITpJlYPBuI8YLVQAzyIjJ+D8QgzdeRxGA4A4hJBiCNIANHkw35gOoG+gBxBRArgHmjAAEYGAB/PDDFmGbzggAAAABJRU5ErkJggg==</PngImage>
</Test>';
SELECT @PngImage.value('(/Test/PngImage)[1]', 'VARBINARY(MAX)');

Returns:

0x89504E470D0...

So, you might want to mention this to whomever is sending you this data. And Base64 encoding is fairly standardized and easy to encode/decode in most languages.


UPDATE

In researching something else I found that the XML datatype in SQL Server is optimized fairly well. It actually does not store the full text of the XML document. At the very least, it creates a dictionary (i.e. array) of unique element and attribute names, assigns each one a number, and uses that number to reference them in the document. This saves an incredible amount of space. So in this particular XML document, not only are the 19 characters of each <Element></Element> element within the <image> node not repeated (which is 38 bytes in UTF-16 encoding), but in the main dictionary the name Element only shows up once even though it is used at two different levels of the structure.

The optimized structure can only be seen when saving the document to an XML field of a table and then viewing the data page via DBCC PAGE. Viewing that shows some amount of bytes used for document overhead, the dictionary of element names (containing "Element" only once), and the 238 elements (representing the bytes of the PNG), each element having 5 bytes of overhead.

I re-checked the size of each variation, this time checking the size directly for the original @xml variable (as found in the Question) and then converting that to both NVARCHAR and to VARCHAR, and then doing similar checks to the Base64 encoded XML via the following change the 2nd to last test above:

DECLARE @ConvertedXml XML;
SET @ConvertedXml = (
  SELECT @PngImage AS [PngImage]
  FOR XML PATH('Test'), BINARY BASE64
);

SELECT DATALENGTH(@PngImage) AS [PNG],
       DATALENGTH(@ConvertedXml) AS [XmlBytes],
       DATALENGTH(CONVERT(NVARCHAR(4000), @ConvertedXml)) AS [NVarCharBytes],
       DATALENGTH(CONVERT(VARCHAR(4000), @ConvertedXml)) AS [VarCharBytes];

And the results:

Format                           Size (in bytes)

PNG (original file)                         238

XML in VARCHAR (or ASCII text file)        5094
XML in NVARCHAR (or UTF-16 text file)     10188
XML datatype                               2295

Base64-XML datatype                         690
Base64-NVARCHAR                             708
Base64-VARCHAR                              354

As you can see, the size varies based on how it is stored. The XML in the Question originally came from a file. So that file, probably regular ASCII / ANSI text file, was 5094 bytes. And if it was stored in a VARCHAR field it would be that same size. If the file was saved with a UTF-16 encoding, then it was actually 10,188 bytes and would be the same size if stored in an NVARCHAR field. But that same XML document, stored in SQL Server in a XML field or variable, is only 2295 bytes! That is kinda cool :-). Although, 2295 bytes is still about 10 times larger than the original PNG file.

But the Base64 encoding still holds up as the best way to store and transmit the binary value in XML. The exact same PNG file is only 690 bytes when stored in a XML field or variable. It is also only 708 bytes if that XML were stored in an NVARCHAR field or UTF-16 text file, and only 354 bytes if stored in a VARCHAR field or ASCII text file.