SQL Server XML – How to Read HTML Code as XML and Get the Output

sql servert-sqlxml

I have HTML code stored in the data base, and I want to read it as XML.

My codes:

http://rextester.com/RMEHO89992

This is an example of the HTML code I have:

<div>
  <section>
       <h4>
         <span> A </span>
        </h4>
        <ul>
           <li>
              <span> Ab</span>
                     AD
              <span> AC </span>
           </li>
           <li>
              <span> Ag</span>
              <span> AL </span>
           </li>
        </ul>
        <h4>
              <span> B </span>
       </h4>
       <ul>
           <li>
              <span> Bb</span>
                     BD
              <span> BC </span>
           </li>
           <li>
              <span> Bg</span>
              <span> BL </span>
           </li>
        </ul>
   </section>
</div>

and this is an example of the output I need:

Category         Selection        Value                    
---------        ---------        ------------             
A                Ab               AD                  
A                Ag               AL                       
B                Bb               BD                   
B                Bg               BL                       

I need to get the value inside the <h4> tag as a Category, the first <span> tag as Selection, and the rest of the values as a concatenated string.

I've tried the following query:

SELECT 
    (  isnull(t.v.value('(h4/span/span[1]/text())[1]','nvarchar(max)'),'') 
     + isnull(t.v.value('(h4/span/text())[1]','nvarchar(max)'),'')
     + isnull(t.v.value('(h4/span/span[2]/text())[2]','nvarchar(max)'),'')
    ) AS [Category],
    (  isnull(c.g.value('(span[1]/text())[1]','nvarchar(max)'),'')
     + isnull(c.g.value('(span[1]/span/text())[1]','nvarchar(max)'),'')
     + isnull(c.g.value('(span[1]/text())[2]','nvarchar(max)'),'')
    ) AS [Selection],
    (  isnull(c.g.value('(span[2]/text())[1]','nvarchar(max)'),'')
     + isnull(c.g.value('(span[2]/span/text())[1]','nvarchar(max)'),'')
     + isnull(c.g.value('(span[2]/text())[2]','nvarchar(max)'),'')
    ) AS [Value]
FROM @htmlXML.nodes('div/section') as t(v)
CROSS APPLY t.v.nodes('./ul/li') AS c(g) 

and :

SELECT 
       t.v.value('.','nvarchar(max)')
       ,
     --( isnull(t.v.value('(h4/span/span[1]/text())[1]','nvarchar(max)'),'')+isnull(t.v.value('(h4/span/text())[1]','nvarchar(max)'),'')+isnull(t.v.value('(h4/span/span[2]/text())[2]','nvarchar(max)'),''))AS [Category],

          ( isnull(c.g.value('(span[1]/text())[1]','nvarchar(max)'),'')+isnull(c.g.value('(span[1]/span/text())[1]','nvarchar(max)'),'')+isnull(c.g.value('(span[1]/text())[2]','nvarchar(max)'),''))AS [Selection]

           ,
         ( isnull(c.g.value('(span[2]/text())[1]','nvarchar(max)'),'')+isnull(c.g.value('(span[2]/span/text())[1]','nvarchar(max)'),'')+isnull(c.g.value('(span[2]/text())[2]','nvarchar(max)'),''))AS [Value]
    FROM    @htmlXML.nodes('div/section/h4/span') as t(v)
    CROSS APPLY @htmlXML.nodes('div/section/ul/li') AS c(g)

But it only gets the first category, and doesn't get all the values togheter.

Category         Selection        Value
---------        ---------        ------------
A                Ab               AC 
B                Ab               AC 
A                Ag               AL
B                Ag               AL 
A                Bb               BC
B                Bb               BC 
A                Bg               BL 
B                Bg               BL 

There can be N categories, and the values might or might not be inside <span> tags.
How can I get all the categories with their corresponding value?
or get :

category              h4 number
--------            -----------
 A                     1
 B                     2
  • 1 ,mean = h4 first , 2 ,mean = h4 second
 ul number         Selection        Value                    
    ---------        ---------        ------------             
    1                Ab               AD                  
    1                Ag               AL                       
    2                Bb               BD                   
    2                Bg               BL       

relation between column ul number and h4 number. i cannt.

Best Answer

This is not exactly elegant but seems to do the job.

DECLARE @X XML = REPLACE(REPLACE(@S, '<h4>', '<foo><h4>'), '</ul>', '</ul></foo>')

SELECT Category = x.value('../../h4[1]/span[1]', 'varchar(10)'),
       Selection = x.value('descendant-or-self::text()[1]', 'varchar(10)'),
       Value = REPLACE(
                REPLACE(
                 REPLACE(
                  LTRIM(
                   RTRIM(
                    REPLACE(
                     REPLACE(
                      CAST(x.x.query('fn:data(descendant-or-self::text()[fn:position() > 1])') AS VARCHAR(MAX))
                     , char(10), '')
                    , char(13), '')
                   )
                  )
                 , '  ', ' |')
                , '| ', '')
               , '|', '')
FROM   @X.nodes('div/section/foo/ul/li') x(x)
ORDER  BY Category,
          Selection

Which returns

+----------+-----------+-------+
| Category | Selection | Value |
+----------+-----------+-------+
|  A       |  Ab       | AD AC |
|  A       |  Ag       | AL    |
|  B       |  Bb       | BD BC |
|  B       |  Bg       | BL    |
+----------+-----------+-------+

I'm assuming this is what you want as the desired results table in the question does not return the "rest of the values as a concatenated string"