Grepping for a block of text with parts that can be optional

awkgrepsedtext processing

I have multiple entries that describes an event in a very large log file, say A.log. I would like to do two things with the event entries in the log file:

  1. Count the number of occurrences of each such entry.(This is not a mandatory requirement but would be nice to have.)
  2. Extract the actual entries in a separate file and study them later on.

A typical event entry would look like the following and will have other texts between them. So in the example below there are two event entries, the first one containing two DataChangeEntry payload and second one containing one DataChangeEntry payload.

    Data control raising event :DataControl@263c015d[[
    #### DataChangeEvent #### on [DataControl name=PatternMatch_LegendTimeAxis, binding=.dynamicRegion1.                         beam_project_PatternMatch_dashboard_LegendTimeAxis_taskflow_LegendTimeAxis_beamDashboardLegendTimeAxisPageDef_beam_project_PatternMatch_dashboard_LegendTimeAxis_taskflow_LegendTimeAxis_beamDashboardLegendTimeAxis_xml_ps_taskflowid.dynamicRegion58.                                                                                                                         beam_project_PatternMatch_view_LegendTimeAxis_taskflow_LegendTimeAxis_beamVizLegendTimeAxisPageDef_beam_project_PatternMatch_view_LegendTimeAxis_taskflow_LegendTimeAxis_beamVizLegendTimeAxis_xml_ps_taskflowid.QueryIterator]
    Filter/Collection Id : 0
    Collection Level     : 0
    Sequence Id             : 616
    ViewSetId            : PatternMatch.LegendTimeAxis_V1_0_SN49
    ==== DataChangeEntry (#1)
    ChangeType           : UPDATE
    KeyPath              : [2014-06-26 06:15:00.0, 0]
    AttributeNames       : [DATAOBJECT_CREATED, COUNTX, QueryName]
    AttributeValues      : [2014-06-26 06:15:00.0, 11, StrAvgCallWaitTimeGreaterThanThreshold]
    AttributeTypes       : [java.sql.Timestamp, java.lang.Integer, java.lang.String,  ]
    ==== DataChangeEntry (#2)
    ChangeType           : UPDATE
    KeyPath              : [2014-06-26 06:15:00.0, 0]
    AttributeNames       : [DATAOBJECT_CREATED, COUNTX, QueryName]
    AttributeValues      : [2014-06-26 06:15:00.0, 9, AverageCallWaitingTimeGreateThanThreshold]
    AttributeTypes       : [java.sql.Timestamp, java.lang.Integer, java.lang.String,  ]

    ]]

someother non useful text
spanning multiple lines 

 Data control raising event :DataControl@263c015d[[
    #### DataChangeEvent #### on [DataControl name=PatternMatch_LegendTimeAxis, binding=.dynamicRegion1.                         beam_project_PatternMatch_dashboard_LegendTimeAxis_taskflow_LegendTimeAxis_beamDashboardLegendTimeAxisPageDef_beam_project_PatternMatch_dashboard_LegendTimeAxis_taskflow_LegendTimeAxis_beamDashboardLegendTimeAxis_xml_ps_taskflowid.dynamicRegion58.                                                                                                                         beam_project_PatternMatch_view_LegendTimeAxis_taskflow_LegendTimeAxis_beamVizLegendTimeAxisPageDef_beam_project_PatternMatch_view_LegendTimeAxis_taskflow_LegendTimeAxis_beamVizLegendTimeAxis_xml_ps_taskflowid.QueryIterator]
    Filter/Collection Id : 0
    Collection Level     : 0
    Sequence Id             : 616
    ViewSetId            : PatternMatch.LegendTimeAxis_V1_0_SN49
    ==== DataChangeEntry (#1)
    ChangeType           : UPDATE
    KeyPath              : [2014-06-26 06:15:00.0, 0]
    AttributeNames       : [DATAOBJECT_CREATED, COUNTX, QueryName]
    AttributeValues      : [2014-06-26 06:15:00.0, 11, StrAvgCallWaitTimeGreaterThanThreshold]
    AttributeTypes       : [java.sql.Timestamp, java.lang.Integer, java.lang.String,  ]

    ]]

Please note that the the number of ==== DataChangeEntry lines in an event entry can be variable. It can also be completely absent which would indicate empty events payload and is a error condition and would definitely like to catch this case as well.

Since in this case the output of entry spans across multiple lines I am not getting far using plain vanilla grep. So I am seeking expert advice.

P.S:

  1. Let me be more explicit about my requirement. I would like to capture the whole block of text shown above verbatim and optionally count the number of instances of such blocks encountered. The option to count the number of instances is good to have but not a mandatory requirement.
  2. If the solution to the problem is using awk I would like to save the awk file and re-use it. So please mention the steps to execute the script also. I know regex and grep but I am not familiar with sed and/or awk.

Best Answer

This would do it i hope. Events go to events file. And messages go to stdout.

Save this file to myprogram.awk (for example):

#!/usr/bin/awk -f

BEGIN {
   s=0;  ### state. Active when parsing inside an event
   nevent=0;  ### Current event number
   printf "" > "events"
}

# Start of event
/^ *Data control raising event/ {
   s=1;
   dentries=0;
   print "*** Event number: " nevent >> "events"
   nevent++
}

# Standard event line
s==1 {
   print >> "events"
}

# DataChangeEntry line
/^ *==== DataChangeEntry/ {
   dentries ++
}

# End of event
s==1 && /^ *\]\]/ {
   s=0;
   print "" >> "events"
   if(dentries==0){
      print "Warning: Event " nevent " has no Data Entries"
   }
}

END {
   print "Total event count: " nevent
}

You can invoke it in different ways:

  • myprogram.awk inputfile.txt
  • awk -f myprogram.awk inputfile.txt

Sample output:

Warning: Event 3 has no Data Entries
Total event count: 3

You can check all the events together in the file called events in working directory.

Related Question