How to Convert Lines to Columns with Awk

awkbashcolumnssedxml

I have to following sample output:

<HARDWARE>
    <NAME>WIN1</NAME>
    <OS>Windows 7</OS>
    <IP>1.2.3.4</IP>
    <DOMAIN>contoso.com</DOMAIN>
</HARDWARE>
<HARDWARE>
    <NAME>WIN2</NAME>
    <OS>Windows 8</OS>
    <IP>10.20.30.40</IP>
    <DOMAIN>contoso.com</DOMAIN>
</HARDWARE>

What is the best way to parse it so it will look like:

WIN1    Windows 7    1.2.3.4     contoso.com
WIN2    Windows 8    10.20.30.40 contoso.com

Looking for a solution to use standard tools like awk, sed etc

Best Answer

Please don't use awk sed etc. They cannot handle XML properly. XML does a bunch of stuff like having whitespace, linefeeds, unary tags etc. that means regular expressions aren't very robust - they break messily, following a perfectly valid change to XML down the line.

The way to handle XML is with a parser. xmlstarlet is one commonly used on Linux. Because I haven't seen it suggested yet- I'd use perl. E.g.:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig -> parsefile ('your_xml_file.xml'); 
foreach my $HW ( $twig -> findnodes ( '//HARDWARE' ) ) {
    print join ( "\t", map { $_ -> text } $HW -> children ),"\n";
}
  • Parse the XML
  • iterate the HARDWARE elements.
  • Extract the text from the children
  • print that.

You could extend it a little to allow you to handle e.g. different field sets/ordering:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my @fields_to_show = qw ( OS NAME ); 

my $twig = XML::Twig -> parsefile ( 'your_filename.xml' ); 
foreach my $HW ( $twig -> findnodes ( '//HARDWARE' ) ) {
    my %fields =  map { $_ -> tag => $_ -> text } $HW -> children;
    print join ("\t", @fields{@fields_to_show}),"\n"; 
}

It generates a hash (associative array) called %fields that looks like (for each element):

$VAR1 = {
          'OS' => 'Windows 7',
          'NAME' => 'WIN1',
          'DOMAIN' => 'contoso.com',
          'IP' => '1.2.3.4'
        };

And then we use @fields_to_show to specify which to display and in which order.

So this will thus print:

Windows 7   WIN1
Windows 8   WIN2

NB: I also has to 'fix' your XML, because without a single root tag it's invalid. Other answers have mentioned this. The XML spec is quite strict - broken XML should be rejected. So it's actually quite bad form to "fix" XML and normally I'd suggest hitting whoever generated it around the head with a copy of the XML spec.

Related Question