I have a symbol table of the form
M07UP49A0870I422.wav <s> haraa keelaa <bn> </s>
M07UP49A0870I423.wav <s> <horn> jau <babble> </s>
M07UP49A0861C86105.wav <s> waa khaada aadi kaa upayoga laabhadaayaka paaya gayaa hai </s>
M07UP49A0861C86106.wav <s> aadi kisaan apnee stara para bhii taiyaara kara sakatee hai </s>
M07UP49A0861C86107.wav <s> kii gobara kaa upayoga kandxee banaakara iindhana kee ruupa mee kiyaa jaata hai <bang> </s>
M07UP49A0861C86108.wav <s> geehuun kii phasala kii katxaayii kee baada <horn> kheeto ko aaga lagaakara saapha kiyaa jaata hai <babble> </s>
M07UP49A0861C86109.wav <s> badxqii maatraa mee jiiwaanqu jalakara nashtxa ho jaataa hai <babble> </s>
As evident, this file contains two columns. The first column is the name of the audio file (with .wav extension) and the second column is the transcript of the audio file
The second column is supposed to consist of not more than 4 words (excluding tags; tags are the words written in <>).
For example, consider the second line. This line has only one word i.e. jau (note that
<s>
</s>
<babble>
<horn>
are not included in the word count of this line because they are tags).
In essence, in any line, a word in the second column is a string not surrounded by <>.
Now my job is find out only those lines that have not more than 4 words in the second column.
I used the following commands,
gawk 'NF>4' file > output
but did not get the results.
For your convenience, here is the expected output
M07UP49A0870I422.wav <s> haraa keelaa <bn> </s>
M07UP49A0870I423.wav <s> <horn> jau <babble> </s>
I got the following output because second column contained only two words i.e. haraa and keelaa and the second line consisted of only one word i.e. jau.
Other than these lines, the lines consisted of more than 4 words in their second column..
Best Answer
Could be done with a small python script:
Assuming you have
python3
installed:get_colls.py
Run it with the file as argument:
Output on the example:
Explanation
The script:
<s>
<
and ending with>
<=
4