Display lines with not more than 4 words in the second column

awktext processing

I have a symbol table of the form

M07UP49A0870I422.wav    <s> haraa keelaa <bn> </s>
M07UP49A0870I423.wav    <s> <horn> jau <babble>  </s>
M07UP49A0861C86105.wav  <s> waa khaada aadi kaa upayoga laabhadaayaka paaya gayaa hai  </s>
M07UP49A0861C86106.wav  <s> aadi kisaan apnee stara para bhii taiyaara kara sakatee hai </s>
M07UP49A0861C86107.wav  <s> kii gobara kaa upayoga kandxee banaakara iindhana kee ruupa mee kiyaa jaata hai <bang> </s>
M07UP49A0861C86108.wav  <s> geehuun kii phasala kii katxaayii kee baada <horn> kheeto ko aaga lagaakara saapha kiyaa jaata hai <babble> </s>
M07UP49A0861C86109.wav  <s> badxqii maatraa mee jiiwaanqu jalakara nashtxa ho jaataa hai <babble> </s>

As evident, this file contains two columns. The first column is the name of the audio file (with .wav extension) and the second column is the transcript of the audio file

The second column is supposed to consist of not more than 4 words (excluding tags; tags are the words written in <>).

For example, consider the second line. This line has only one word i.e. jau (note that

<s> 
</s> 
<babble> 
<horn>

are not included in the word count of this line because they are tags).

In essence, in any line, a word in the second column is a string not surrounded by <>.

Now my job is find out only those lines that have not more than 4 words in the second column.

I used the following commands,

gawk 'NF>4' file > output

but did not get the results.

For your convenience, here is the expected output

M07UP49A0870I422.wav    <s> haraa keelaa <bn> </s>
M07UP49A0870I423.wav    <s> <horn> jau <babble>  </s>

I got the following output because second column contained only two words i.e. haraa and keelaa and the second line consisted of only one word i.e. jau.

Other than these lines, the lines consisted of more than 4 words in their second column..

Best Answer

Could be done with a small python script:

#!/usr/bin/env python3
import sys

for l in open(sys.argv[1]).readlines():
    l = l.strip()
    print(l) if len([s for s in l.split("<s>")[-1].split()
             if not all([s.startswith("<"), s.endswith(">")])]) <= 4 else ("")

Assuming you have python3 installed:

Copy it into an empty file, save it as get_colls.py
Run it with the file as argument:
```
python3 /path/to/get_colls.py <file>
```

Output on the example:

M07UP49A0870I422.wav    <s> haraa keelaa <bn> </s>
M07UP49A0870I423.wav    <s> <horn> jau <babble>  </s>

Explanation

The script:

splits the line by the delimiter <s>
in the second section, counts the strings, not starting with < and ending with >
prints out the lines with length <= 4

Related Solutions

Bash – How to extract lines by words in specific position, not column

You can use

grep -E '^.{21}A' file

if you want to include cases like A1023, and

grep -E '^.{21}A\>' file

if you want only lines where A appears as an isolated character

NOTE: In the second example the notation \> will match any trailing empty strings.

excerpt from grep man page

The Backslash Character and Special Expressions

The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word. The symbol \w is a synonym for [_[:alnum:]] and \W is a synonym for [^_[:alnum:]].

Print only lines where the first column is unique

awk normally reads each line of the input and invokes the script on it. The cases where you would use getline are few and far between. When your script is run with six lines of input, this is an overview of what happens:

Read line 1 normally

Set variables
Call getline, which reads line 2
Compare variables

Read line 3 normally

Set variables
Call getline, which reads line 4
Compare variables

Read line 5 normally

Set variables
Call getline, which reads line 6
Compare variables

Obviously this isn’t going to work.
Secondly, you made a common mistake in your awk code. In awk, fields from the input are referenced as $number and variables are referenced as variable_name. This is different from shell scripts, where command line arguments are referenced as $number and variables are referenced as $variable_name. Your test
```
if ($1 != $path)
```
should be
```
if ($1 != path)
```
Your overall approach is flawed. You can’t identify strings that occur only once in the file by looking at two lines at a time. I believe that you can do it by looking at three lines at a time (i.e., by keeping the two previous lines in variables), but things like that get complicated and messy. It’s probably simpler to count occurrences. Here’s a minimal modification on your script to do that.
```
awk '{
  if ($1 != path) {
    if (count == 1) {
      print prev
    }
    count=1
  }
  else count++
  prev=$0; path=$1
}
END {
    if (count == 1) {
      print prev
    }
}'
```
I deleted type, since you never used it.

Disclosure: This is essentially the same as the last part of glenn’s answer.

Best Answer

Explanation

Related Solutions

Bash – How to extract lines by words in specific position, not column

The Backslash Character and Special Expressions

Print only lines where the first column is unique

Related Question