Bash – Sed? Stripping all formatting, i.e line breaks and whitespaces, from a report text file, while masking out certain pieces

basheditorssedshell-scripttext processing

I am working on a project in which I need to remove all formatting from a text file including whitespaces and line breaks, then replace any colons with pipes. I've made some headway but I cannot find a way to mask out the parts that need to be ignored. I am new to sed and am only at novice level with Bash scripting, and am, in fact, not entirely sure sed is the right tool for the job (maybe vi? I typically use Nano). The file that I am trying to format is similar to this

== LUN mysql05-dbdat02 ==

  LUNName:                        mysql05-dbdat02
  CollectionStartTime:            2012-09-20T15:43:03-04:00
  CollectionEndTime:              2012-09-20T15:43:34-04:00
  Capacity
    CurrentCapacity:              512
  IOOperations
    Reads:                        100
    Writes:                       0
    ReadsPerSecond:               0.000000
    WritesPerSecond:              0.000000
    ReadMBPerSecond:              0.000
    WriteMBPerSecond:             0.000
    TotalMBPerSecond:             0.000
    NonOptimizedIOPerSecond:      0.000000
    CacheHitPercentage:           0.000
  PerformanceMetrics
    TotalIOsPerSecond:            0.000
    ReadIOsPerSecond:             0.000
    WriteIOsPerSecond:            0.000
    TotalMBPerSecond:             0.000
    ReadMBPerSecond:              0.000
    WriteMBPerSecond:             0.000
  Performance

== LUN mysql05-dbdat02 ==

  LUNName:                        mysql05-dbdat02
  CollectionStartTime:            2012-09-20T15:43:03-04:00
  CollectionEndTime:              2012-09-20T15:43:34-04:00
  Capacity
    CurrentCapacity:              512
  IOOperations
    Reads:                        100
    Writes:                       0
    ReadsPerSecond:               0.000000
    WritesPerSecond:              0.000000
    ReadMBPerSecond:              0.000
    WriteMBPerSecond:             0.000
    TotalMBPerSecond:             0.000
    NonOptimizedIOPerSecond:      0.000000
    CacheHitPercentage:           0.000
  PerformanceMetrics
    TotalIOsPerSecond:            0.000
    ReadIOsPerSecond:             0.000
    WriteIOsPerSecond:            0.000
    TotalMBPerSecond:             0.000
    ReadMBPerSecond:              0.000
    WriteMBPerSecond:             0.000
  Performance

and the output needs to be something like this,

cm-data-unity01|LUNNam=cm-data-unity01|CollectionStartTim=2012-09-20T15:43:03-04:00|CollectionEndTim=2012-09-20T15:43:34-04:00|Capacity|CurrentCapacit=2048|IOOperations|Read=10|Write=90|ReadsPerSecon=8.000000|WritesPerSecon=76.000000|ReadMBPerSecon=0.430|WriteMBPerSecon=0.542|TotalMBPerSecon=0.973|NonOptimizedIOPerSecon=85.000000|CacheHitPercentag=0.000|PerformanceMetrics|TotalIOsPerSecon=84.000|ReadIOsPerSecon=8.000|WriteIOsPerSecon=76.000|TotalMBPerSecon=0.973|ReadMBPerSecon=0.430|WriteMBPerSecon=0.542|Performance|

or, all on one line.

I have written a very simple Bash script to format it, like thus

# Author Christopher George Bollinger
# Comments: This script will modify the snippet.txt file.  
# This script is meant to, first, take a specific bit of unformatted data and  remove all line breaks and non-printable characters.

# Following this, the script is to replace any appropriate colons (those being used as delimiters) and replace them with the equals (=) character.
#!/bin/bash

echo "This script will remove line breaks, remove non-printable characters, and will replace colons used as field delimiters with the equals '(=)' character."
cp snippet.txt snippetwork.txt

RmLB ()
{
tr -d '\n' < snippetwork.txt > snippetwork1.txt

}

RmNonPrint ()
{
tr -cd "[:print:]" < snippetwork1.txt > snippetwork2.txt

}

RplcW ()
{
sed 's/: /=/g' snippetwork2.txt > snippetwork3.txt

}

RmWtSpc ()
{
tr -s ' ' '|' < snippetwork3.txt > snippetgood.txt
sed 'd/(?:[a-z]=) /'
}

QuChek ()
{
cat snippetgood.txt
read -p "Is this satisfactory? (Y/n)" Choice
case $Choice in
    Y|y)
    mv snippetgood.txt snippet.txt
    rm -f snippetwork*
    rm -f snippetgood.txt
    ;;
    N|n)
    exit
    ;;
    *)
    echo "Invalid Input."
    ;;
esac
}

read -p "Would you like to begin? (Y/n)" YorN

case $YorN in
    Y|y)
    RmLB
    RmNonPrint
    RplcW
    RmWtSpc
    QuChek
    ;;
    N|n)
    exit
    ;;
    *)
    echo "Invalid Selection"
    ;;
esac

Which functions except the output is not quite right, it gives:

==|LUN|mysql05-dbdat02|==|LUNName=|mysql05-dbdat02|CollectionStartTime=|2012-09-20T15:43:03-04:00|CollectionEndTime=|2012-09-20T15:43:34-04:00|Capacity|CurrentCapacity=|512|IOOperations|Reads=|100|Writes=|0|ReadsPerSecond=|0.000000|WritesPerSecond=|0.000000|ReadMBPerSecond=|0.000|WriteMBPerSecond=|0.000|TotalMBPerSecond=|0.000|NonOptimizedIOPerSecond=|0.000000|CacheHitPercentage=|0.000|PerformanceMetrics|TotalIOsPerSecond=|0.000|ReadIOsPerSecond=|0.000|WriteIOsPerSecond=|0.000|TotalMBPerSecond=|0.000|ReadMBPerSecond=|0.000|WriteMBPerSecond=|0.000|Performance|==|LUN|mysql05-dbdat02|==|LUNName=|mysql05-dbdat02|CollectionStartTime=|2012-09-20T15:43:03-04:00|CollectionEndTime=|2012-09-20T15:43:34-04:00|Capacity|CurrentCapacity=|512|IOOperations|Reads=|100|Writes=|0|ReadsPerSecond=|0.000000|WritesPerSecond=|0.000000|ReadMBPerSecond=|0.000|WriteMBPerSecond=|0.000|TotalMBPerSecond=|0.000|NonOptimizedIOPerSecond=|0.000000|CacheHitPercentage=|0.000|PerformanceMetrics|TotalIOsPerSecond=|0.000|ReadIOsPerSecond=|0.000|WriteIOsPerSecond=|0.000|TotalMBPerSecond=|0.000|ReadMBPerSecond=|0.000|WriteMBPerSecond=|0.000|Performance|

the problem being the pipes appearing following the equals signs. If anyone could point me in the right direction on getting this right, or even to an online resource for some clarification, I would be immensely grateful.

Funny thing is the end game for this is that, while the immediate request is to format like the above example, to feed this into a Unix cli graphing tool (my guess is gnuplot). From what I understand, gnuplot requires the formatting to be in columns. As mentioned, this is new territory for me and I would greatly appreciate any advice given.

Best Answer

I am not quite sure what you're trying to do. Using your first input file, I create this output:

LUNName=mysql05-dbdat02|CollectionStartTime=2012-09-20T15:43:03-04:00|CollectionEndTime=2012-09-20T15:43:34-04:00|Capacity|CurrentCapacity=512|IOOperations|Reads=100|Writes=0|ReadsPerSecond=0.000000|WritesPerSecond=0.000000|ReadMBPerSecond=0.000|WriteMBPerSecond=0.000|TotalMBPerSecond=0.000|NonOptimizedIOPerSecond=0.000000|CacheHitPercentage=0.000|PerformanceMetrics|TotalIOsPerSecond=0.000|ReadIOsPerSecond=0.000|WriteIOsPerSecond=0.000|TotalMBPerSecond=0.000|ReadMBPerSecond=0.000|WriteMBPerSecond=0.000|Performance|

With this perl one liner:

perl -pe 's/\n/|/;s/\s*//g; s/:/=/; END{print "\n"}' file

You could also do it with this:

sed -r 's/\s*//g; s/:/=/;' file | tr '\n' '|'
Related Question