Extract the Children of a Specific XML Element Type

awkctagssedtext processingxml

Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:

<!-- data.xml -->

<instance ab=1 >
    <a1>aa</a1>
    <a2>aa</a2>
</instance>
<instance ab=2 >
    <b1>bb</b1>
    <b2>bb</b2>
</instance>
<instance ab=3 >
    <c1>cc</c1>
    <c2>cc</c2>
</instance>

I would like a script or command which takes this data as input and produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

I would like for the solution to use standard text-processing tools such as sed or awk.

I tried using the following sed command, but it did not work:

sed -n '/<Sample/,/<\/Sample/p' data.xml

Best Answer

If you really want sed- or awk-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:

You should also be aware that there are several XML-specific programming/query languages:

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:

<!-- data.xml -->

<instances>

    <instance ab='1'>
        <a1>aa</a1>
        <a2>aa</a2>
    </instance>

    <instance ab='2'>
        <b1>bb</b1>
        <b2>bb</b2>
    </instance>

    <instance ab='3'>
        <c1>cc</c1>
        <c2>cc</c2>
    </instance>

</instances>

If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

This produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
    print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

And here is how you could run the script:

python extract_instance_children.py data.xml

This uses the xml package from the Python Standard Library which is also a strict XML parser.

If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk script (as requested):

#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN {
    addchild=0;
    children="";
}

{
    # Opening tag for "instance" element - set the "addchild" flag
    if($0 ~ "^ *<instance[^<>]+>") {
        addchild=1;
    }

    # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
    else if($0 ~ "^ *</instance>" && addchild == 1) {
        addchild=0;
        printf("%s\n", children);
        children="";
    }

    # Concatenating child elements - strip whitespace
    else if (addchild == 1) {
        gsub(/^[ \t]+/,"",$0);
        gsub(/[ \t]+$/,"",$0);
        children=children $0;
    }
}

To execute the script from a file, you would use a command like this one:

awk -f extract_instance_children.awk data.xml

And here is a Bash script that produces the desired output:

#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

    # Set the instance flag to true if we come across an opening tag
    if echo "${line}" | grep -q '<instance.*>'; then
        instance=1

    # Set the instance flag to false and print a newline if we come across a closing tag
    elif echo "${line}" | grep -q '</instance>'; then
        instance=0
        echo

    # If we're inside an instance tag then print the child element
    elif [[ ${instance} == 1 ]]; then
        printf "${line}"
    fi

done < "${1}"

You would execute it like this:

bash extract_instance_children.bash data.xml

Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
    soup = Soup(xmlfile.read(), "html.parser")
    for instance in soup.findAll('instance'):
        print(''.join([str(child) for child in instance.findChildren()]))
Related Question