How to Convert Lines to Columns with Awk

awkbashcolumnssedxml

I have to following sample output:

<HARDWARE>
    <NAME>WIN1</NAME>
    <OS>Windows 7</OS>
    <IP>1.2.3.4</IP>
    <DOMAIN>contoso.com</DOMAIN>
</HARDWARE>
<HARDWARE>
    <NAME>WIN2</NAME>
    <OS>Windows 8</OS>
    <IP>10.20.30.40</IP>
    <DOMAIN>contoso.com</DOMAIN>
</HARDWARE>

What is the best way to parse it so it will look like:

WIN1    Windows 7    1.2.3.4     contoso.com
WIN2    Windows 8    10.20.30.40 contoso.com

Looking for a solution to use standard tools like awk, sed etc

Best Answer

Please don't use awk sed etc. They cannot handle XML properly. XML does a bunch of stuff like having whitespace, linefeeds, unary tags etc. that means regular expressions aren't very robust - they break messily, following a perfectly valid change to XML down the line.

The way to handle XML is with a parser. xmlstarlet is one commonly used on Linux. Because I haven't seen it suggested yet- I'd use perl. E.g.:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig -> parsefile ('your_xml_file.xml'); 
foreach my $HW ( $twig -> findnodes ( '//HARDWARE' ) ) {
    print join ( "\t", map { $_ -> text } $HW -> children ),"\n";
}

Parse the XML
iterate the HARDWARE elements.
Extract the text from the children
print that.

You could extend it a little to allow you to handle e.g. different field sets/ordering:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my @fields_to_show = qw ( OS NAME ); 

my $twig = XML::Twig -> parsefile ( 'your_filename.xml' ); 
foreach my $HW ( $twig -> findnodes ( '//HARDWARE' ) ) {
    my %fields =  map { $_ -> tag => $_ -> text } $HW -> children;
    print join ("\t", @fields{@fields_to_show}),"\n"; 
}

It generates a hash (associative array) called %fields that looks like (for each element):

$VAR1 = {
          'OS' => 'Windows 7',
          'NAME' => 'WIN1',
          'DOMAIN' => 'contoso.com',
          'IP' => '1.2.3.4'
        };

And then we use @fields_to_show to specify which to display and in which order.

So this will thus print:

Windows 7   WIN1
Windows 8   WIN2

NB: I also has to 'fix' your XML, because without a single root tag it's invalid. Other answers have mentioned this. The XML spec is quite strict - broken XML should be rejected. So it's actually quite bad form to "fix" XML and normally I'd suggest hitting whoever generated it around the head with a copy of the XML spec.

Related Solutions

Lum – How to sed output be formatted like printf’s formatted printing

This uses the extended regex syntax -r, which clears up a lot of the clutter. Also, because you already know some of the field values, you don't actually need to back-reference them, again reducing clutter (and overhead).

& is a special replacement value: it hold the entire matched pattern. Using the &, again reduces clutter. As it is not a back-reference, it has significantly less overhead.

I've used ( +) vs. ( *). The + assumes that there is at least one space between input fields. Just change it to the * it that is not the case.

EXPL=
dom=oracle
typ=hard
itm=nproc
val=666

echo "oracle   hard   nproc    131072" |
  sed -r "s/^$dom( +)$typ( +)$itm( +).*/$EXPL#&\n$dom\1$typ\2$itm\3$val/"

output

#oracle   hard   nproc    131072
oracle   hard   nproc    666

Bash – Working with columns – awk and sed

Just use awk directly:

awk '/\/g/ {
        gsub(/\./, "", $2)
        gsub(/../, "&:", $2)
        sub(/:$/, "", $2) 
        print $2,$3
}'

With this solution you don't need grep nor sed.

Best Answer

Related Solutions

Lum – How to sed output be formatted like printf’s formatted printing

Bash – Working with columns – awk and sed

Related Question