Search replace in XML file with sed or awk

awkregular expressionsedtext processingxml

So I have a task where by I have to manipulate an XML file through a bash shell script.

Here are the steps:

Query XML file for a value.
Take the value and cross reference it to find a new value from a list.
Replace the value of a different element with the new value.

Here is a sample of the XML with non-essential info removed:

<fmreq:fileManagementRequestDetail xmlns:fmreq="http://foobar.com/filemanagement">
      <fmreq:property>
         <fmreq:name>form_category_cd</fmreq:name>
         <fmreq:value>Memos</fmreq:value>
      </fmreq:property>
      <fmreq:property>
         <fmreq:name>object_name</fmreq:name>
         <fmreq:value>Correspondence</fmreq:value>
      </fmreq:property>
</fmreq:fileManagementRequestDetail>

I have to get the value from the value element under object_name, cross reference it, and then replace the value under the form_category_cd value element with the new value:

So if object_name -> value is Correspondence then the form_category_cd -> value might need to be YYZ.

Here's the rub, I can only use the tools available on our server as our operations group is restricting us to the tools at hand. It was a fight to get xmllint updated and then it got overruled. I'm on a version that does not support –xpath, which believe me is difficult on a good day. Also the version I have available doesn't support namespaces, so xmllint is out.

I've tried sed, but it seems to not like my regex even though every tester I try works fine.

Regex:

(<fmreq\:name>object_name<\/fmreq\:name>)(?:\n\s*)(<fmreq\:value>)(.*)(<\/fmreq\:value>)

I need to get group #3, but sed won't return it. Instead it returns the entire contents of the XML file.

sed -e 's/\(<fmreq\:name>object_name<\/fmreq\:name>\)\(?:\n\s*\)\(<fmreq\:value>\)\(.*\)\(<\/fmreq\:value>\)/\3/' < c3.xml

I'm not as familiar with awk / gawk, so I'm struggling to figure them out and this as well, but am open to them if a solution can be found.

Would love to have an awk / gawk solution just to make the boss happy since he's an old awk fan, but I'll take what I can get as I'm stumped.

Again I have to use the tools on hand and can't install anything new.

Best Answer

I think that there are a couple of problems in your sed command:

You don't use the -n option, so by default sed just prints every line of input to the output (possibly modified by a sed command).
You don't need the redirection < c3.xml, because sed recognizes the last argument as a filename.
sed is not very well suited for matches over multiple lines. See for example here.

The following seems to work on your example:

sed -n "/<fmreq:name>object_name<\/fmreq:name>/ {n;p}" c3.xml | sed "s/^\s*<fmreq:value>\(.*\)<\/fmreq:value>/\1/g"

Or, with only one sed invocation:

sed -n "/<fmreq:name>object_name<\/fmreq\:name>/ {n;s/^\s*<fmreq:value>\(.*\)<\/fmreq:value>/\1/g;p}" c3.xml

Breakdown of what this command does:

The option -n tells sed not to print the pattern space after it's finished processing the line. Consequently, you need to use the command p explicitely to do so.
/regex/ tells sed to execute the commands that follow only on the lines that match regex.
The sed command n replaces the content of the pattern space by the next line of input, which is the one containing the value you are interested in.
The sed command s/regex/replacement/ substitutes the first match of regex in the pattern space by replacement.
The sed command p prints the line.

Summary

I wrote a Python solution, a Bash solution, and an Awk solution. The idea for all the scripts is the same: go through line-by-line and use flag variables to keep track of state (i.e. whether or not we're currently inside an XML subdocument and whether or not we've found a matching line).

In the Python script I read all of the lines into a list and keep track of the list-index where the current XML subdocument begins so that I can print out the current subdocument when we reach the closing tag. I check each line for the regex pattern and use a flag to keep track of whether or not to output the current subdocument when we're done processing it.

In the Bash script I use a temporary file as a buffer to store the current XML subdocument and wait until it's done being written before using grep to check if it contains a line matching the given regex.

The Awk script is similar to the Base script, but I use Awk array for the buffer instead of a file.

Test Data File

I checked both scripts against the following data file (data.xml) based on the example data given in your question:

<a>
  <b>
    string to search for: stuff
  </b>
</a>
in between xml documents there may be plain text log messages
<x>
    unicode string: øæå
</x>

Python Solution

Here's a simple Python script that does what you want:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""

import sys
import re

invert_match = False

if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
    invert_match = True
    sys.argv.pop(0)

regex = sys.argv[1]

# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile:

    # Read all of the data into a list
    lines = xmlfile.readlines()

    # Use flags to keep track of which XML subdocument we're in
    # and whether or not we've found a match in that document
    start_index = closing_tag = regex_match = False

    # Iterate through all the lines
    for index, line in enumerate(lines):

        # Remove trailing and leading white-space
        line = line.strip()

        # If we have a start_index then we're inside an XML document
        if start_index is not False:

            # If this line is a closing tag then reset the flags
            # and print the document if we found a match
            if line == closing_tag:
                if regex_match != invert_match:
                    print(''.join(lines[start_index:index+1]))
                start_index = closing_tag = regex_match = False

            # If this line is NOT a closing tag then we
            # search the current line for a match
            elif re.search(regex, line):
                regex_match = True

        # If we do NOT have a start_index then we're either at the
        # beginning of a new XML subdocument or we're inbetween
        # XML subdocuments
        else:

            # Check for an opening tag for a new XML subdocument
            match = re.match(r'^<(\w+)>$', line)
            if match:

                # Store the current line number
                start_index = index

                # Construct the matching closing tag
                closing_tag = '</' + match.groups()[0] + '>'

Here's how you run the script to search for the string "stuff":

python xmlgrep.py stuff data.xml

And here's the output:

<a>
  <b>
    string to search for: stuff
  </b>
</a>

And here's how you run the script to search for the string "øæå":

python xmlgrep.py øæå data.xml

And here's the output:

<x>
    unicode string: øæå
</x>

You can also specify -v or --invert-match to search for non-matching documents, and work on stdin:

cat data.xml | python xmlgrep.py -v stuff

Bash Solution

Here is bash implementation of the same basic algorithm. It uses flags to keep track of whether or the current line belongs to an XML document and uses a temporary file as a buffer to store each XML document as it's being processed.

#!/usr/bin/env bash
# xmlgrep.sh

# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"

# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""

# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"

# Reset the internal field separator to preserver white-space
export IFS=''

# Iterate through all the lines of the file
while read LINE; do

    # If we're already in an XML subdocument then update
    # the temporary file and check to see if we've reached
    # the end of the document
    if "${XML_DOC}"; then

        # Append the line to the temp-file
        echo "${LINE}" >> "${TEMPFILE}"

        # If this line is a closing tag then reset the flags
        if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
            XML_DOC=false
            CLOSING_TAG=""

            # Print the document if it contains the match pattern 
            if grep -Pq "${REGEX}" "${TEMPFILE}"; then
                cat "${TEMPFILE}"
            fi
        fi

    # Otherwise we check to see if we've reached
    # the beginning of a new XML subdocument
    elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then

        # Extract the tag-name
        TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')"

        # Construct the corresponding closing tag
        CLOSING_TAG="</${TAG_NAME}>"

        # Set the XML_DOC flag so we know we're inside an XML subdocument
        XML_DOC=true

        # Start storing the subdocument in the temporary file 
        echo "${LINE}" > "${TEMPFILE}"
    fi
done < "${FILENAME}"

Here's how you could run the script to search for the string 'stuff':

bash xmlgrep.sh data.xml 'stuff'

And here's the corresponding output:

<a>
  <b>
    string to search for: stuff
  </b>
</a>

Here's how you might run the script to search for the string 'øæå':

bash xmlgrep.sh data.xml 'øæå'

And here's the corresponding output:

<x>
    unicode string: øæå
</x>

Awk Solution

Here is an awk solution - my awk isn't great though, so it's pretty rough. It use the same basic idea as the Bash and Python scripts. It stores each XML document in a buffer (an awk array) and uses flags to keep track of state. When it finishes processing a document it prints it if it contains any lines matching the given regular expression. Here is the script:

#!/usr/bin/env gawk
# xmlgrep.awk

# Variables:
#
#   XML_DOC
#       XML_DOC=1 if the current line is inside an XML document.
#
#   CLOSING_TAG
#       Stores the closing tag for the current XML document.
#
#   BUFFER_LENGTH
#       Stores the number of lines in the current XML document.
#
#   MATCH
#       MATCH=1 if we found a matching line in the current XML document.
#
#   PATTERN
#       The regular expression pattern to match against (given as a command-line argument).
#

# Initialize Variables
BEGIN{
    XML_DOC=0;
    CLOSING_TAG="";
    BUFFER_LENGTH=0;
    MATCH=0;
}
{
    if (XML_DOC==1) {

        # If we're inside an XML block, add the current line to the buffer
        BUFFER[BUFFER_LENGTH]=$0;
        BUFFER_LENGTH++;

        # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
        if ($0 ~ CLOSING_TAG) {
            XML_DOC=0;
            CLOSING_TAG="";

            # If there was a match then output the XML document
            if (MATCH==1) {
                for (i in BUFFER) {
                    print BUFFER[i];
                }
            }
        }
        # If we found a matching line then update the MATCH flag
        else {
            if ($0 ~ PATTERN) {
                MATCH=1;
            }
        }
    }
    else {

        # If we reach a new opening tag then start storing the data in the buffer
        if ($0 ~ /<[a-z]+>/) {

            # Set the XML_DOC flag
            XML_DOC=1;

            # Reset the buffer
            delete BUFFER;
            BUFFER[0]=$0;
            BUFFER_LENGTH=1;

            # Reset the match flag
            MATCH=0;

            # Compute the corresponding closing tag
            match($0, /<([a-z]+)>/, match_groups);
            CLOSING_TAG="</" match_groups[1] ">";
        }
    }
}

Here is how you would call it:

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

And here is the corresponding output:

<x>
    unicode string: øæå
</x>

Best Answer

Related Solutions

Parsing XML, JSON, and newer data file formats in UNIX using command line utilities

Filter xml documents matching certains ids

Summary

Test Data File

Python Solution

Bash Solution

Awk Solution

Related Question