Display lines with not more than 4 words in the second column

awktext processing

I have a symbol table of the form

M07UP49A0870I422.wav    <s> haraa keelaa <bn> </s>
M07UP49A0870I423.wav    <s> <horn> jau <babble>  </s>
M07UP49A0861C86105.wav  <s> waa khaada aadi kaa upayoga laabhadaayaka paaya gayaa hai  </s>
M07UP49A0861C86106.wav  <s> aadi kisaan apnee stara para bhii taiyaara kara sakatee hai </s>
M07UP49A0861C86107.wav  <s> kii gobara kaa upayoga kandxee banaakara iindhana kee ruupa mee kiyaa jaata hai <bang> </s>
M07UP49A0861C86108.wav  <s> geehuun kii phasala kii katxaayii kee baada <horn> kheeto ko aaga lagaakara saapha kiyaa jaata hai <babble> </s>
M07UP49A0861C86109.wav  <s> badxqii maatraa mee jiiwaanqu jalakara nashtxa ho jaataa hai <babble> </s>

As evident, this file contains two columns. The first column is the name of the audio file (with .wav extension) and the second column is the transcript of the audio file

The second column is supposed to consist of not more than 4 words (excluding tags; tags are the words written in <>).

For example, consider the second line. This line has only one word i.e. jau (note that

<s> 
</s> 
<babble> 
<horn> 

are not included in the word count of this line because they are tags).

In essence, in any line, a word in the second column is a string not surrounded by <>.

Now my job is find out only those lines that have not more than 4 words in the second column.

I used the following commands,

gawk 'NF>4' file > output

but did not get the results.

For your convenience, here is the expected output

M07UP49A0870I422.wav    <s> haraa keelaa <bn> </s>
M07UP49A0870I423.wav    <s> <horn> jau <babble>  </s>

I got the following output because second column contained only two words i.e. haraa and keelaa and the second line consisted of only one word i.e. jau.

Other than these lines, the lines consisted of more than 4 words in their second column..

Best Answer

Could be done with a small python script:

#!/usr/bin/env python3
import sys

for l in open(sys.argv[1]).readlines():
    l = l.strip()
    print(l) if len([s for s in l.split("<s>")[-1].split()
             if not all([s.startswith("<"), s.endswith(">")])]) <= 4 else ("")

Assuming you have python3 installed:

  • Copy it into an empty file, save it as get_colls.py
  • Run it with the file as argument:

    python3 /path/to/get_colls.py <file>
    

Output on the example:

M07UP49A0870I422.wav    <s> haraa keelaa <bn> </s>
M07UP49A0870I423.wav    <s> <horn> jau <babble>  </s>

Explanation

The script:

  • splits the line by the delimiter <s>
  • in the second section, counts the strings, not starting with < and ending with >
  • prints out the lines with length <= 4