Python – case sensitive substitution; same target ids

perlpythonsed

I am struggled myself to make a case sensitive replacement in a text file. Please find below a segment of my sed file that I am running as
sed -f file.sed < input.txt > output.txt

 s/\<code_229633_13\>/R77_08349T0/
 s/\<code_229633_138\>/R77_09738T0/
 s/\<code_230519_10\>/R77_04813T0/
 s/\<code_230519_1\>/R77_13591T0/
 s/\<code_230519_13\>/R77_05463T0/
 up to line 14521....

The code is working great but I have also cases where I have 2 or more TARGET ids (code_010512_23 and code_299097_0) ovelapping the same REPLACEMENT id (R77_14520T0) and I would like to have as output something like R77_14520T0.a and R77_14520T0.b (lines 1 and 2 below)

s/code_010512_23/R77_14520T0/ --> R77_14520T0.a
s/code_299097_0/R77_14520T0/ --> R77_14520T0.b

Furthermore, a more complex but similar case is when i have the following input file (input2.txt file):

  ID=gene09464;Name=code_229633_13;isoforms=1           
  ID=mRNA10661;Parent=gene09464;Name=code_229633_13         
  ID=exon26192;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0  1   1093    +
  ID=exon26193;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0  1094    1873    +

  ID=gene09491;Name=code_229633_138;isoforms=1          
  ID=mRNA10690;Parent=gene09491;Name=code_229633_138            
  ID=exon26252;Parent=mRNA10690;Name=code_229633_138;Target=R77_09738T0 1   411 +

  ID=gene09513;Name=code_230519_10;isoforms=1           
  ID=mRNA10715;Parent=gene09513;Name=code_230519_10         
  ID=exon26311;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0  1   59  +
  ID=exon26312;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0  60  186 +

  ID=gene09511;Name=code_230519_1;isoforms=1            
  ID=mRNA10713;Parent=gene09511;Name=code_230519_1          
  ID=exon26308;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0   1   1075    +
  ID=exon26309;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0   1076    1128    +

  ID=gene09514;Name=code_230519_13;isoforms=1           
  ID=mRNA10716;Parent=gene09514;Name=code_230519_13         
  ID=exon26316;Parent=mRNA10716;Name=code_230519_13;Target=R77_05463T0  1   219 +

  ID=gene00865;Name=code_010512_23;isoforms=1           
  ID=mRNA00979;Parent=gene00865;Name=code_010512_23         
  ID=exon02477;Parent=mRNA00979;Name=code_010512_23;Target=R77_14520T0  1   143 +

  ID=gene14561;Name=code_299097_0;isoforms=2            
  ID=mRNA16419;Parent=gene14561;Name=code_299097_0          
  ID=exon39828;Parent=mRNA16419;Name=code_299097_0;Target=R77_14520T0   144 193 +
  ID=mRNA16420;Parent=gene14561;Name=code_299097_0          
  ID=exon39828;Parent=mRNA16420;Name=code_299097_0;Target=R77_15554T0   408 457 +

and I need to apply the replacements with the same as the previous way only on the lines which contain the word "isoforms", in other words in lines 1,6,10, 15,20, 24 and 28 and nowhere else in the text. The format of this input file would be exactly as depicted with blank lines among the "isoforms" lines.

My desired output

 ID=gene09464;Name=R77_08349T0;isoforms=1           
 ID=mRNA10661;Parent=gene09464;Name=code_229633_13          
 ID=exon26192;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1   1093    +
 ID=exon26193;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1094    1873    +
 ID=exon26194;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1874    4065    +

 ID=gene09491;Name=R77_09738T0;isoforms=1           
 ID=mRNA10690;Parent=gene09491;Name=code_229633_138         
 ID=exon26252;Parent=mRNA10690;Name=code_229633_138;Target=R77_09738T0  1   411 +

 ID=gene09513;Name=Target=R77_04813T0;isoforms=1            
 ID=mRNA10715;Parent=gene09513;Name=code_230519_10          
 ID=exon26311;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   1   59  +
 ID=exon26312;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   60  186 +
 ID=exon26313;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   187 678 +
 ID=exon26314;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   679 1399    +
 ID=exon26315;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   1400    1402    +

 ID=gene09511;Name=R77_13591T0;isoforms=1           
 ID=mRNA10713;Parent=gene09511;Name=code_230519_1           
 ID=exon26308;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0    1   1075    +
 ID=exon26309;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0    1076    1128    +

 ID=gene09514;Name=R77_05463T0;isoforms=1           
 ID=mRNA10716;Parent=gene09514;Name=code_230519_13          
 ID=exon26316;Parent=mRNA10716;Name=code_230519_13;Target=R77_05463T0   1   219 +

 ID=gene00865;Name=R77_14520T0.a;isoforms=1         
 ID=mRNA00979;Parent=gene00865;Name=code_010512_23          
 ID=exon02477;Parent=mRNA00979;Name=code_010512_23;Target=R77_14520T0   1   143 +

 ID=gene14561;Name=R77_14520T0.b;isoforms=2         
 ID=mRNA16419;Parent=gene14561;Name=code_299097_0           
 ID=exon39828;Parent=mRNA16419;Name=code_299097_0;Target=R77_14520T0    144 193 +
 ID=mRNA16420;Parent=gene14561;Name=code_299097_0           
 ID=exon39828;Parent=mRNA16420;Name=code_299097_0;Target=R77_15554T0    408 457 +

Best Answer

You can't really do this kind of thing with sed, it's just a text stream editor. Try this Perl scriptlet:

#!/usr/bin/env perl 

## Set the record separator to \n\n to
## read multiple lines as a single record
$/="\n\n";
## This array will contain all lines of the file
my @lines=<>;

## The list of suffixes
@suffix=(a..z); 

## For each line of the input file
foreach (@lines) {
    ## If the current line (lines are now the actual multiline records
    ## because we set $/ to consecutive newlines) is one we are interested in.
    if (/isoforms.*?Target=(\S+)/s){
    ## Keep a list of seen targets
    $seen{$1}++;
    }

}
## Now that we have processed the entire file
## go back and print each line.
foreach (@lines) {

    ## If this line is one of the ones we're interested in
    if(/Name=(.+?);.*?isoforms=.*?Target=(\S+)/s){
    $name=$1; $target=$2;
    ## This is needed so we can know whether
    ## how many times we've seen this target so far.
    $newseen{$target}++;
    ## If this target exists more than once in the input file
    if ($seen{$target}>1) {
        ## Use the %newseen hash to choose the right letter.
        ## The -1 is needed because the first element of an
        ## array is 0, not 1.
        s/$name/$target.$suffix[$newseen{$target}-1]/;
    }
    else {
        s/$name/$target/;
    }
    }
    print;
}

Save the script above as foo.pl, make it executable (chmod a+x foo.pl) and run on your input file:

./foo.pl input.txt > output.txt

Summary

I wrote a Python solution, a Bash solution, and an Awk solution. The idea for all the scripts is the same: go through line-by-line and use flag variables to keep track of state (i.e. whether or not we're currently inside an XML subdocument and whether or not we've found a matching line).

In the Python script I read all of the lines into a list and keep track of the list-index where the current XML subdocument begins so that I can print out the current subdocument when we reach the closing tag. I check each line for the regex pattern and use a flag to keep track of whether or not to output the current subdocument when we're done processing it.

In the Bash script I use a temporary file as a buffer to store the current XML subdocument and wait until it's done being written before using grep to check if it contains a line matching the given regex.

The Awk script is similar to the Base script, but I use Awk array for the buffer instead of a file.

Test Data File

I checked both scripts against the following data file (data.xml) based on the example data given in your question:

<a>
  <b>
    string to search for: stuff
  </b>
</a>
in between xml documents there may be plain text log messages
<x>
    unicode string: øæå
</x>

Python Solution

Here's a simple Python script that does what you want:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""

import sys
import re

invert_match = False

if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
    invert_match = True
    sys.argv.pop(0)

regex = sys.argv[1]

# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile:

    # Read all of the data into a list
    lines = xmlfile.readlines()

    # Use flags to keep track of which XML subdocument we're in
    # and whether or not we've found a match in that document
    start_index = closing_tag = regex_match = False

    # Iterate through all the lines
    for index, line in enumerate(lines):

        # Remove trailing and leading white-space
        line = line.strip()

        # If we have a start_index then we're inside an XML document
        if start_index is not False:

            # If this line is a closing tag then reset the flags
            # and print the document if we found a match
            if line == closing_tag:
                if regex_match != invert_match:
                    print(''.join(lines[start_index:index+1]))
                start_index = closing_tag = regex_match = False

            # If this line is NOT a closing tag then we
            # search the current line for a match
            elif re.search(regex, line):
                regex_match = True

        # If we do NOT have a start_index then we're either at the
        # beginning of a new XML subdocument or we're inbetween
        # XML subdocuments
        else:

            # Check for an opening tag for a new XML subdocument
            match = re.match(r'^<(\w+)>$', line)
            if match:

                # Store the current line number
                start_index = index

                # Construct the matching closing tag
                closing_tag = '</' + match.groups()[0] + '>'

Here's how you run the script to search for the string "stuff":

python xmlgrep.py stuff data.xml

And here's the output:

<a>
  <b>
    string to search for: stuff
  </b>
</a>

And here's how you run the script to search for the string "øæå":

python xmlgrep.py øæå data.xml

And here's the output:

<x>
    unicode string: øæå
</x>

You can also specify -v or --invert-match to search for non-matching documents, and work on stdin:

cat data.xml | python xmlgrep.py -v stuff

Bash Solution

Here is bash implementation of the same basic algorithm. It uses flags to keep track of whether or the current line belongs to an XML document and uses a temporary file as a buffer to store each XML document as it's being processed.

#!/usr/bin/env bash
# xmlgrep.sh

# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"

# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""

# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"

# Reset the internal field separator to preserver white-space
export IFS=''

# Iterate through all the lines of the file
while read LINE; do

    # If we're already in an XML subdocument then update
    # the temporary file and check to see if we've reached
    # the end of the document
    if "${XML_DOC}"; then

        # Append the line to the temp-file
        echo "${LINE}" >> "${TEMPFILE}"

        # If this line is a closing tag then reset the flags
        if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
            XML_DOC=false
            CLOSING_TAG=""

            # Print the document if it contains the match pattern 
            if grep -Pq "${REGEX}" "${TEMPFILE}"; then
                cat "${TEMPFILE}"
            fi
        fi

    # Otherwise we check to see if we've reached
    # the beginning of a new XML subdocument
    elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then

        # Extract the tag-name
        TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')"

        # Construct the corresponding closing tag
        CLOSING_TAG="</${TAG_NAME}>"

        # Set the XML_DOC flag so we know we're inside an XML subdocument
        XML_DOC=true

        # Start storing the subdocument in the temporary file 
        echo "${LINE}" > "${TEMPFILE}"
    fi
done < "${FILENAME}"

Here's how you could run the script to search for the string 'stuff':

bash xmlgrep.sh data.xml 'stuff'

And here's the corresponding output:

<a>
  <b>
    string to search for: stuff
  </b>
</a>

Here's how you might run the script to search for the string 'øæå':

bash xmlgrep.sh data.xml 'øæå'

And here's the corresponding output:

<x>
    unicode string: øæå
</x>

Awk Solution

Here is an awk solution - my awk isn't great though, so it's pretty rough. It use the same basic idea as the Bash and Python scripts. It stores each XML document in a buffer (an awk array) and uses flags to keep track of state. When it finishes processing a document it prints it if it contains any lines matching the given regular expression. Here is the script:

#!/usr/bin/env gawk
# xmlgrep.awk

# Variables:
#
#   XML_DOC
#       XML_DOC=1 if the current line is inside an XML document.
#
#   CLOSING_TAG
#       Stores the closing tag for the current XML document.
#
#   BUFFER_LENGTH
#       Stores the number of lines in the current XML document.
#
#   MATCH
#       MATCH=1 if we found a matching line in the current XML document.
#
#   PATTERN
#       The regular expression pattern to match against (given as a command-line argument).
#

# Initialize Variables
BEGIN{
    XML_DOC=0;
    CLOSING_TAG="";
    BUFFER_LENGTH=0;
    MATCH=0;
}
{
    if (XML_DOC==1) {

        # If we're inside an XML block, add the current line to the buffer
        BUFFER[BUFFER_LENGTH]=$0;
        BUFFER_LENGTH++;

        # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
        if ($0 ~ CLOSING_TAG) {
            XML_DOC=0;
            CLOSING_TAG="";

            # If there was a match then output the XML document
            if (MATCH==1) {
                for (i in BUFFER) {
                    print BUFFER[i];
                }
            }
        }
        # If we found a matching line then update the MATCH flag
        else {
            if ($0 ~ PATTERN) {
                MATCH=1;
            }
        }
    }
    else {

        # If we reach a new opening tag then start storing the data in the buffer
        if ($0 ~ /<[a-z]+>/) {

            # Set the XML_DOC flag
            XML_DOC=1;

            # Reset the buffer
            delete BUFFER;
            BUFFER[0]=$0;
            BUFFER_LENGTH=1;

            # Reset the match flag
            MATCH=0;

            # Compute the corresponding closing tag
            match($0, /<([a-z]+)>/, match_groups);
            CLOSING_TAG="</" match_groups[1] ">";
        }
    }
}

Here is how you would call it:

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

And here is the corresponding output:

<x>
    unicode string: øæå
</x>

Best Answer

Related Solutions

Filter xml documents matching certains ids

Summary

Test Data File

Python Solution

Bash Solution

Awk Solution

Related Question