Filter xml documents matching certains ids

awkgrepperlsedxml

Suppose you have a file containing many xml documents, like

<a>
  <b>
  ...
</a>
in between xml documents there may be plain text log messages
<x>
  ...
</x>

...

How would I filter this file to show only those xml documents where a given regexp matches any one of the lines of that xml document? I'm talking about a simple textual match here, so the regexp matching part may as well be totally ignorant of the underlying format – xml.

You can assume that the opening and closing tags of the root element are always on lines of their own (though may be white-space padded), and that they are only used as root elements, i.e. tags with the same name do not appear below the root element. This should make it possible to get the job done without having to resort to xml aware tools.

Best Answer

Summary

I wrote a Python solution, a Bash solution, and an Awk solution. The idea for all the scripts is the same: go through line-by-line and use flag variables to keep track of state (i.e. whether or not we're currently inside an XML subdocument and whether or not we've found a matching line).

In the Python script I read all of the lines into a list and keep track of the list-index where the current XML subdocument begins so that I can print out the current subdocument when we reach the closing tag. I check each line for the regex pattern and use a flag to keep track of whether or not to output the current subdocument when we're done processing it.

In the Bash script I use a temporary file as a buffer to store the current XML subdocument and wait until it's done being written before using grep to check if it contains a line matching the given regex.

The Awk script is similar to the Base script, but I use Awk array for the buffer instead of a file.

Test Data File

I checked both scripts against the following data file (data.xml) based on the example data given in your question:

<a>
  <b>
    string to search for: stuff
  </b>
</a>
in between xml documents there may be plain text log messages
<x>
    unicode string: øæå
</x>

Python Solution

Here's a simple Python script that does what you want:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""

import sys
import re

invert_match = False

if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
    invert_match = True
    sys.argv.pop(0)

regex = sys.argv[1]

# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile:

    # Read all of the data into a list
    lines = xmlfile.readlines()

    # Use flags to keep track of which XML subdocument we're in
    # and whether or not we've found a match in that document
    start_index = closing_tag = regex_match = False

    # Iterate through all the lines
    for index, line in enumerate(lines):

        # Remove trailing and leading white-space
        line = line.strip()

        # If we have a start_index then we're inside an XML document
        if start_index is not False:

            # If this line is a closing tag then reset the flags
            # and print the document if we found a match
            if line == closing_tag:
                if regex_match != invert_match:
                    print(''.join(lines[start_index:index+1]))
                start_index = closing_tag = regex_match = False

            # If this line is NOT a closing tag then we
            # search the current line for a match
            elif re.search(regex, line):
                regex_match = True

        # If we do NOT have a start_index then we're either at the
        # beginning of a new XML subdocument or we're inbetween
        # XML subdocuments
        else:

            # Check for an opening tag for a new XML subdocument
            match = re.match(r'^<(\w+)>$', line)
            if match:

                # Store the current line number
                start_index = index

                # Construct the matching closing tag
                closing_tag = '</' + match.groups()[0] + '>'

Here's how you run the script to search for the string "stuff":

python xmlgrep.py stuff data.xml

And here's the output:

<a>
  <b>
    string to search for: stuff
  </b>
</a>

And here's how you run the script to search for the string "øæå":

python xmlgrep.py øæå data.xml

And here's the output:

<x>
    unicode string: øæå
</x>

You can also specify -v or --invert-match to search for non-matching documents, and work on stdin:

cat data.xml | python xmlgrep.py -v stuff

Bash Solution

Here is bash implementation of the same basic algorithm. It uses flags to keep track of whether or the current line belongs to an XML document and uses a temporary file as a buffer to store each XML document as it's being processed.

#!/usr/bin/env bash
# xmlgrep.sh

# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"

# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""

# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"

# Reset the internal field separator to preserver white-space
export IFS=''

# Iterate through all the lines of the file
while read LINE; do

    # If we're already in an XML subdocument then update
    # the temporary file and check to see if we've reached
    # the end of the document
    if "${XML_DOC}"; then

        # Append the line to the temp-file
        echo "${LINE}" >> "${TEMPFILE}"

        # If this line is a closing tag then reset the flags
        if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
            XML_DOC=false
            CLOSING_TAG=""

            # Print the document if it contains the match pattern 
            if grep -Pq "${REGEX}" "${TEMPFILE}"; then
                cat "${TEMPFILE}"
            fi
        fi

    # Otherwise we check to see if we've reached
    # the beginning of a new XML subdocument
    elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then

        # Extract the tag-name
        TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')"

        # Construct the corresponding closing tag
        CLOSING_TAG="</${TAG_NAME}>"

        # Set the XML_DOC flag so we know we're inside an XML subdocument
        XML_DOC=true

        # Start storing the subdocument in the temporary file 
        echo "${LINE}" > "${TEMPFILE}"
    fi
done < "${FILENAME}"

Here's how you could run the script to search for the string 'stuff':

bash xmlgrep.sh data.xml 'stuff'

And here's the corresponding output:

<a>
  <b>
    string to search for: stuff
  </b>
</a>

Here's how you might run the script to search for the string 'øæå':

bash xmlgrep.sh data.xml 'øæå'

And here's the corresponding output:

<x>
    unicode string: øæå
</x>

Awk Solution

Here is an awk solution - my awk isn't great though, so it's pretty rough. It use the same basic idea as the Bash and Python scripts. It stores each XML document in a buffer (an awk array) and uses flags to keep track of state. When it finishes processing a document it prints it if it contains any lines matching the given regular expression. Here is the script:

#!/usr/bin/env gawk
# xmlgrep.awk

# Variables:
#
#   XML_DOC
#       XML_DOC=1 if the current line is inside an XML document.
#
#   CLOSING_TAG
#       Stores the closing tag for the current XML document.
#
#   BUFFER_LENGTH
#       Stores the number of lines in the current XML document.
#
#   MATCH
#       MATCH=1 if we found a matching line in the current XML document.
#
#   PATTERN
#       The regular expression pattern to match against (given as a command-line argument).
#

# Initialize Variables
BEGIN{
    XML_DOC=0;
    CLOSING_TAG="";
    BUFFER_LENGTH=0;
    MATCH=0;
}
{
    if (XML_DOC==1) {

        # If we're inside an XML block, add the current line to the buffer
        BUFFER[BUFFER_LENGTH]=$0;
        BUFFER_LENGTH++;

        # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
        if ($0 ~ CLOSING_TAG) {
            XML_DOC=0;
            CLOSING_TAG="";

            # If there was a match then output the XML document
            if (MATCH==1) {
                for (i in BUFFER) {
                    print BUFFER[i];
                }
            }
        }
        # If we found a matching line then update the MATCH flag
        else {
            if ($0 ~ PATTERN) {
                MATCH=1;
            }
        }
    }
    else {

        # If we reach a new opening tag then start storing the data in the buffer
        if ($0 ~ /<[a-z]+>/) {

            # Set the XML_DOC flag
            XML_DOC=1;

            # Reset the buffer
            delete BUFFER;
            BUFFER[0]=$0;
            BUFFER_LENGTH=1;

            # Reset the match flag
            MATCH=0;

            # Compute the corresponding closing tag
            match($0, /<([a-z]+)>/, match_groups);
            CLOSING_TAG="</" match_groups[1] ">";
        }
    }
}

Here is how you would call it:

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

And here is the corresponding output:

<x>
    unicode string: øæå
</x>