Summary
I wrote a Python solution, a Bash solution, and an Awk solution. The idea for all the scripts is the same: go through line-by-line and use flag variables to keep track of state (i.e. whether or not we're currently inside an XML subdocument and whether or not we've found a matching line).
In the Python script I read all of the lines into a list and keep track of the list-index where the current XML subdocument begins so that I can print out the current subdocument when we reach the closing tag. I check each line for the regex pattern and use a flag to keep track of whether or not to output the current subdocument when we're done processing it.
In the Bash script I use a temporary file as a buffer to store the current XML subdocument and wait until it's done being written before using grep
to check if it contains a line matching the given regex.
The Awk script is similar to the Base script, but I use Awk array for the buffer instead of a file.
Test Data File
I checked both scripts against the following data file (data.xml
) based on the example data given in your question:
<a>
<b>
string to search for: stuff
</b>
</a>
in between xml documents there may be plain text log messages
<x>
unicode string: øæå
</x>
Python Solution
Here's a simple Python script that does what you want:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""
import sys
import re
invert_match = False
if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
invert_match = True
sys.argv.pop(0)
regex = sys.argv[1]
# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile:
# Read all of the data into a list
lines = xmlfile.readlines()
# Use flags to keep track of which XML subdocument we're in
# and whether or not we've found a match in that document
start_index = closing_tag = regex_match = False
# Iterate through all the lines
for index, line in enumerate(lines):
# Remove trailing and leading white-space
line = line.strip()
# If we have a start_index then we're inside an XML document
if start_index is not False:
# If this line is a closing tag then reset the flags
# and print the document if we found a match
if line == closing_tag:
if regex_match != invert_match:
print(''.join(lines[start_index:index+1]))
start_index = closing_tag = regex_match = False
# If this line is NOT a closing tag then we
# search the current line for a match
elif re.search(regex, line):
regex_match = True
# If we do NOT have a start_index then we're either at the
# beginning of a new XML subdocument or we're inbetween
# XML subdocuments
else:
# Check for an opening tag for a new XML subdocument
match = re.match(r'^<(\w+)>$', line)
if match:
# Store the current line number
start_index = index
# Construct the matching closing tag
closing_tag = '</' + match.groups()[0] + '>'
Here's how you run the script to search for the string "stuff":
python xmlgrep.py stuff data.xml
And here's the output:
<a>
<b>
string to search for: stuff
</b>
</a>
And here's how you run the script to search for the string "øæå":
python xmlgrep.py øæå data.xml
And here's the output:
<x>
unicode string: øæå
</x>
You can also specify -v
or --invert-match
to search for non-matching documents, and work on stdin:
cat data.xml | python xmlgrep.py -v stuff
Bash Solution
Here is bash implementation of the same basic algorithm. It uses flags to keep track of whether or the current line belongs to an XML document and uses a temporary file as a buffer to store each XML document as it's being processed.
#!/usr/bin/env bash
# xmlgrep.sh
# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"
# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""
# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"
# Reset the internal field separator to preserver white-space
export IFS=''
# Iterate through all the lines of the file
while read LINE; do
# If we're already in an XML subdocument then update
# the temporary file and check to see if we've reached
# the end of the document
if "${XML_DOC}"; then
# Append the line to the temp-file
echo "${LINE}" >> "${TEMPFILE}"
# If this line is a closing tag then reset the flags
if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
XML_DOC=false
CLOSING_TAG=""
# Print the document if it contains the match pattern
if grep -Pq "${REGEX}" "${TEMPFILE}"; then
cat "${TEMPFILE}"
fi
fi
# Otherwise we check to see if we've reached
# the beginning of a new XML subdocument
elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then
# Extract the tag-name
TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')"
# Construct the corresponding closing tag
CLOSING_TAG="</${TAG_NAME}>"
# Set the XML_DOC flag so we know we're inside an XML subdocument
XML_DOC=true
# Start storing the subdocument in the temporary file
echo "${LINE}" > "${TEMPFILE}"
fi
done < "${FILENAME}"
Here's how you could run the script to search for the string 'stuff':
bash xmlgrep.sh data.xml 'stuff'
And here's the corresponding output:
<a>
<b>
string to search for: stuff
</b>
</a>
Here's how you might run the script to search for the string 'øæå':
bash xmlgrep.sh data.xml 'øæå'
And here's the corresponding output:
<x>
unicode string: øæå
</x>
Awk Solution
Here is an awk
solution - my awk
isn't great though, so it's pretty rough. It use the same basic idea as the Bash and Python scripts. It stores each XML document in a buffer (an awk
array) and uses flags to keep track of state. When it finishes processing a document it prints it if it contains any lines matching the given regular expression. Here is the script:
#!/usr/bin/env gawk
# xmlgrep.awk
# Variables:
#
# XML_DOC
# XML_DOC=1 if the current line is inside an XML document.
#
# CLOSING_TAG
# Stores the closing tag for the current XML document.
#
# BUFFER_LENGTH
# Stores the number of lines in the current XML document.
#
# MATCH
# MATCH=1 if we found a matching line in the current XML document.
#
# PATTERN
# The regular expression pattern to match against (given as a command-line argument).
#
# Initialize Variables
BEGIN{
XML_DOC=0;
CLOSING_TAG="";
BUFFER_LENGTH=0;
MATCH=0;
}
{
if (XML_DOC==1) {
# If we're inside an XML block, add the current line to the buffer
BUFFER[BUFFER_LENGTH]=$0;
BUFFER_LENGTH++;
# If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
if ($0 ~ CLOSING_TAG) {
XML_DOC=0;
CLOSING_TAG="";
# If there was a match then output the XML document
if (MATCH==1) {
for (i in BUFFER) {
print BUFFER[i];
}
}
}
# If we found a matching line then update the MATCH flag
else {
if ($0 ~ PATTERN) {
MATCH=1;
}
}
}
else {
# If we reach a new opening tag then start storing the data in the buffer
if ($0 ~ /<[a-z]+>/) {
# Set the XML_DOC flag
XML_DOC=1;
# Reset the buffer
delete BUFFER;
BUFFER[0]=$0;
BUFFER_LENGTH=1;
# Reset the match flag
MATCH=0;
# Compute the corresponding closing tag
match($0, /<([a-z]+)>/, match_groups);
CLOSING_TAG="</" match_groups[1] ">";
}
}
}
Here is how you would call it:
gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml
And here is the corresponding output:
<x>
unicode string: øæå
</x>
Best Answer
I think that there are a couple of problems in your
sed
command:You don't use the
-n
option, so by defaultsed
just prints every line of input to the output (possibly modified by ased
command).You don't need the redirection
< c3.xml
, becausesed
recognizes the last argument as a filename.sed
is not very well suited for matches over multiple lines. See for example here.The following seems to work on your example:
Or, with only one
sed
invocation:Breakdown of what this command does:
The option
-n
tellssed
not to print the pattern space after it's finished processing the line. Consequently, you need to use the commandp
explicitely to do so./regex/
tellssed
to execute the commands that follow only on the lines that matchregex
.The
sed
commandn
replaces the content of the pattern space by the next line of input, which is the one containing the value you are interested in.The
sed
commands/regex/replacement/
substitutes the first match ofregex
in the pattern space byreplacement
.The
sed
commandp
prints the line.