SED XML – Adding Numerical Suffixes to Tag-Names to Distinguish XML Elements

sedxml

I have an XML file with multiple child elements which have the same tag-name, ex. <Name>Luigi</Name>, <Name>Mario</Name>, <Name>Peach</Name>. Here is a mock-up of what my input file looks like:

<!-- names.xml -->
<Names>
    <Name>Luigi</Name>
    <Name>Mario</Name>
    <Name>Peach</Name>
</Names>

When I throw this file into Excel for analysis it creates a new record for each Name element. This is awesome from a readability perspective, but it makes it difficult to discern if I have lots of duplicate data outside of the name fields.

What I want to do is rename the tags to Name1, Name2, Name3 so that they all appear on the same row when I import them into Excel. That way I'll be able to find records that are useless to me or that contain duplicates – without having to constantly look at the raw data.

In other words, I want a script or command which produces the following output:

<!-- names.xml -->
<Names>
    <Name1>Luigi</Name1>
    <Name2>Mario</Name2>
    <Name3>Peach</Name3>
</Names>

Is it possible to do this with a sed command or other Unix script?

Best Answer

Since you specifically asked for sed, here is a sed/bash script that should do what you want, provided that each <Name> element is opened and closed on the same line:

(IFS='';
n=0;
while read line; do
    if echo "${line}" | grep -Pq "<Name>\w+</Name>"; then
        ((n++));
        echo "${line}" | sed "s/<Name>\(\w\+\)<\/Name>/<Name${n}>\1<\/Name${n}>/";
    else
        echo "${line}";
    fi;
done) < names.xml

I tested it with this input file:

<!-- names.xml -->
<Names>
    <Name>Luigi</Name>
    <Name>Mario</Name>
    <Name>Peach</Name>
</Names>

And it produced the following output:

<Names>
    <Name1>Luigi</Name1>
    <Name2>Mario</Name2>
    <Name3>Peach</Name3>
</Names>

That said, this seems like a good candidate for a language with an XML parsing library. Here is a Python script that does what you want:

#!/usr/bin/env python2
# -*- encoding: ascii -*-

# add_suffix.py

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Update the XML tree
suffix = 0
for name in root.iter("Name"):
    suffix += 1
    name.tag += str(suffix)

# Write out the updated data
tree.write(sys.argv[2])

Run it like this:

python add_suffix.py names.xml new_names.xml

Related Solutions

How to sort XML elements in-place

I'd tackle it using perl and XML::Twig.

perl has a sort function, which allows you to specify arbitary criteria for comparing a sequence of values. As long as your function returns positive, negative or zero based on relative ordering.

So this is where the magic happens - we specify a sort criteria that:

Compares based on node name (tag)
Then compares based on attribute existence
Then compares on attribute value.

It needs to do this recursively across your structure to sort sub-nodes too.

So:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;

my $xml = XML::Twig -> new -> parsefile ('sample.xml');

sub compare_elements {
   ## perl sort uses $a and $b to compare. 
   ## in this case, it's nodes we expect;

   #tag is the node name. 
   my $compare_by_tag = $a -> tag cmp $b -> tag;
   #conditional return - this works because cmp returns zero
   #if the values are the same.
   return $compare_by_tag if $compare_by_tag; 

   #bit more complicated - extract all the attributes of both a and b, and then compare them sequentially:
   #This is to handle case where you've got mismatched attributes.
   #this may be irrelevant based on your input. 
   my %all_atts;
   foreach my $key ( keys %{$a->atts}, keys %{$b->atts}) { 
      $all_atts{$key}++;
   }
   #iterate all the attributes we've seen - in either element. 
   foreach my $key_to_compare ( sort keys %all_atts ) {

      #test if this attribute exists. If it doesn't in one, but does in the other, then that gets sorted to the top. 
      my $exists = ($a -> att($key_to_compare) ? 1 : 0) <=> ($b -> att($key_to_compare) ? 1 : 0);
      return $exists if $exists;

      #attribute exists in both - extract value, and compare them alphanumerically. 
      my $comparison =  $a -> att($key_to_compare) cmp $b -> att($key_to_compare);
      return $comparison if $comparison;
   }
   #we have fallen through all our comparisons, we therefore assume the nodes are the same and return zero. 
   return 0;
}

#recursive sort - traverses to the lowest node in the tree first, and then sorts that, before
#working back up. 
sub sort_children {
   my ( $node ) = @_;
   foreach my $child ( $node -> children ) { 
      #sort this child if is has child nodes. 
      if ( $child -> children ) { 
         sort_children ( $child )
      }     
   }  

   #iterate each of the child nodes of this one, sorting based on above criteria
      foreach my $element ( sort { compare_elements } $node -> children ) {

         #cut everything, then append to the end.
         #because we've ordered these, then this will work as a reorder operation. 
         $element -> cut;
         $element -> paste ( last_child => $node );
      }
}

#set off recursive sort. 
sort_children ( $xml -> root );

#set output formatting. indented_a implicitly sorts attributes. 
$xml -> set_pretty_print ( 'indented_a');
$xml -> print;

Which given your input, outputs:

<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
  <component name="ChangeListManager">
    <ignored path=".idea/dataSources.local.xml" />
    <ignored path=".idea/workspace.xml" />
    <ignored path="tilde.iws" />
    <option
        name="EXCLUDED_CONVERTED_TO_IGNORED"
        value="true"
    />
    <option
        name="HIGHLIGHT_CONFLICTS"
        value="true"
    />
    <option
        name="HIGHLIGHT_NON_ACTIVE_CHANGELIST"
        value="false"
    />
    <option
        name="LAST_RESOLUTION"
        value="IGNORE"
    />
    <option
        name="SHOW_DIALOG"
        value="false"
    />
    <option
        name="TRACKING_ENABLED"
        value="true"
    />
  </component>
  <component name="ToolWindowManager">
    <editor active="false" />
    <frame
        extended-state="0"
        height="1179"
        width="958"
        x="1201"
        y="380"
    />
    <layout>
      <window_info
          active="false"
          anchor="bottom"
          auto_hide="false"
          content_ui="tabs"
          id="TODO"
          internal_type="DOCKED"
          order="6"
          show_stripe_button="true"
          sideWeight="0.5"
          side_tool="false"
          type="DOCKED"
          visible="false"
          weight="0.33"
      />
      <window_info
          active="false"
          anchor="left"
          auto_hide="false"
          content_ui="tabs"
          id="Palette&#x09;"
          internal_type="DOCKED"
          order="2"
          show_stripe_button="true"
          sideWeight="0.5"
          side_tool="false"
          type="DOCKED"
          visible="false"
          weight="0.33"
      />
    </layout>
  </component>
</project>

No matter what order the various child nodes fall in.

I personally like indented_a because it wraps attributes to new lines, and I think that's clearer. But indented output format could do the same trick.

Best Answer

Related Solutions

How to sort XML elements in-place

Related Question