Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:
<!-- data.xml -->
<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>
I would like a script or command which takes this data as input and produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
I would like for the solution to use standard text-processing tools such as sed
or awk
.
I tried using the following sed
command, but it did not work:
sed -n '/<Sample/,/<\/Sample/p' data.xml
Best Answer
If you really want
sed
- orawk
-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:You should also be aware that there are several XML-specific programming/query languages:
Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:
If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:
This produces the following output:
Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:
And here is how you could run the script:
This uses the xml package from the Python Standard Library which is also a strict XML parser.
If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an
awk
script (as requested):To execute the script from a file, you would use a command like this one:
And here is a Bash script that produces the desired output:
You would execute it like this:
Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result: