Suppose you have a file containing many xml documents, like
<a>
<b>
...
</a>
in between xml documents there may be plain text log messages
<x>
...
</x>
...
How would I filter this file to show only those xml documents where a given regexp matches any one of the lines of that xml document? I'm talking about a simple textual match here, so the regexp matching part may as well be totally ignorant of the underlying format – xml.
You can assume that the opening and closing tags of the root element are always on lines of their own (though may be white-space padded), and that they are only used as root elements, i.e. tags with the same name do not appear below the root element. This should make it possible to get the job done without having to resort to xml aware tools.
Best Answer
Summary
I wrote a Python solution, a Bash solution, and an Awk solution. The idea for all the scripts is the same: go through line-by-line and use flag variables to keep track of state (i.e. whether or not we're currently inside an XML subdocument and whether or not we've found a matching line).
In the Python script I read all of the lines into a list and keep track of the list-index where the current XML subdocument begins so that I can print out the current subdocument when we reach the closing tag. I check each line for the regex pattern and use a flag to keep track of whether or not to output the current subdocument when we're done processing it.
In the Bash script I use a temporary file as a buffer to store the current XML subdocument and wait until it's done being written before using
grep
to check if it contains a line matching the given regex.The Awk script is similar to the Base script, but I use Awk array for the buffer instead of a file.
Test Data File
I checked both scripts against the following data file (
data.xml
) based on the example data given in your question:Python Solution
Here's a simple Python script that does what you want:
Here's how you run the script to search for the string "stuff":
And here's the output:
And here's how you run the script to search for the string "øæå":
And here's the output:
You can also specify
-v
or--invert-match
to search for non-matching documents, and work on stdin:Bash Solution
Here is bash implementation of the same basic algorithm. It uses flags to keep track of whether or the current line belongs to an XML document and uses a temporary file as a buffer to store each XML document as it's being processed.
Here's how you could run the script to search for the string 'stuff':
And here's the corresponding output:
Here's how you might run the script to search for the string 'øæå':
And here's the corresponding output:
Awk Solution
Here is an
awk
solution - myawk
isn't great though, so it's pretty rough. It use the same basic idea as the Bash and Python scripts. It stores each XML document in a buffer (anawk
array) and uses flags to keep track of state. When it finishes processing a document it prints it if it contains any lines matching the given regular expression. Here is the script:Here is how you would call it:
And here is the corresponding output: