Shell – Extract paragraph separated with *** using AWK

awkgrepsedshelltext processing

I have a file like below:

blablabla
blablabla
***
thingsIwantToRead1
thingsIwantToRead2
thingsIwantToRead3

blablabla
blablabla

I want to extract the paragraph with thingsIwantToRead. When I had to deal with such a problem, I used AWK like this:

awk 'BEGIN{ FS="Separator above the paragraph"; RS="" } {print $2}' $file.txt | awk 'BEGIN{ FS="separator below the paragraph"; RS="" } {print $1}'

And it worked.

In this case, I tried to put FS="***", "\*{3}", "\*\*"
(it is not working because AWK treats it like a normal asterisk), "\\*\\*" or whatever regex I could think of, but it's not working (it's printing nothing).

Do you know why?

If not, do you know another way to deal with my problem?

Below an extract of the file I want to parse:

13.2000000000     , 3*0.00000000000       ,  11.6500000000     , 3*0.00000000000       ,  17.8800000000

Blablabla

  SATELLITE EPHEMERIS
     ===================
Output frame: Mean of J2000

       Epoch                  A            E            I           RA           AofP          TA      Flight Ang
*****************************************************************************************************************
2012/10/01 00:00:00.000     6998.239     0.001233     97.95558     77.41733     89.98551    290.75808    359.93398
2012/10/01 00:05:00.000     6993.163     0.001168     97.95869     77.41920    124.72698    274.57362    359.93327
2012/10/01 00:10:00.000     6987.347     0.001004     97.96219     77.42327    170.94020    246.92395    359.94706
2012/10/01 00:15:00.000     6983.173     0.000893     97.96468     77.42930    224.76158    211.67042    359.97311
 <np>
 ----------------
 Predicted Orbit:
 ----------------

 Blablabla

And I want to extract:

2012/10/01 00:00:00.000     6998.239     0.001233     97.95558     77.41733     89.98551    290.75808    359.93398
2012/10/01 00:05:00.000     6993.163     0.001168     97.95869     77.41920    124.72698    274.57362    359.93327
2012/10/01 00:10:00.000     6987.347     0.001004     97.96219     77.42327    170.94020    246.92395    359.94706
2012/10/01 00:15:00.000     6983.173     0.000893     97.96468     77.42930    224.76158    211.67042    359.97311

And the command I tried to use to get the numbers after the line of *'s:

`awk 'BEGIN{ FS="\\*{2,}"; RS="" } {print $2}' file | awk 'BEGIN{ FS="<np>"; RS="" } {print $1}'`

Best Answer

Tell awk to print between the two delimiters. Specifically:

awk '/\*{4,}/,/<np>/' file

That will also print the lines containing the delimiters, so you can remove them with:

awk '/\*{4,}/,/<np>/' file | tail -n +2 | head -n -1

Alternatively, you can set a variable to true if a line matches the 1st delimiter and to false when it matches the second and only print when it is true:

awk '/\*{4,}/{a=1; next}/<np>/{a=0}(a==1){print}' file

The command above will set a to 1 if the current line matches 4 or more * and will also skip to the next line. This means that the *** line will never be printed.


This was in answer to the original, misunderstood, version of the question. I'm leaving it here since it can be useful in a slightly different situation.

First of all, you don't want FS (field separator), you want RS (record separator). Then, to pass a literal *, you need to escape it twice. Once to escape the * and once to escape the backslash (otherwise, awk will try to match it in the same way as \r or \t). Then, you print the 2nd "line":

$ awk -vRS='\\*\\*\\*' 'NR==2' file

thingsIwantToRead1   
thingsIwantToRead2   
thingsIwantToRead3  

To avoid the blank lines around the output, use:

$ awk -vRS='\n\\*\\*\\*\n' 'NR==2' file
thingsIwantToRead1   
thingsIwantToRead2   
thingsIwantToRead3  

Note that this assumes a *** after each paragraph, not only after the first one as you show.

Related Question