Bash Scripting – Search for Three Consecutive Words

bashregular expressionsearchshellshell-script

There are duplicates in my booklist (txt file) like the following –

The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Joy on Demand: The Art of Discovering the Happiness Within
Crucial Conversations Tools for Talking When Stakes Are High
Joy on Demand

Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself
......
......
......

I need to find the duplicate books and manually remove them after checking. I searched and found the lines need a pattern.

Ex.

Remove duplicate lines based on a partial line comparison

Find partial duplicate lines in a file and count how many time each line was duplicated?

But in my case finding a pattern in lines is difficult. However, I found a pattern in the sequence of words.

I want to mark lines as duplicate only if they have three consecutive words (case insensitive).

If you see you will find that in –

The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Ideal Team Player is the consecutive words which I am looking for.

I would like the output to be something like the following –

3 Ideal Team Player
2 Joy on Demand
2 Search Inside Yourself
......
......
......

How can I do that?

Best Answer

The following awk program stores a count for how many times each set of three consecutive words occurs (after removing punctuation characters), and prints the counts and the set of words at the end if the count is larger than 1:

{
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                w[$(i-2),$(i-1),$i]++
}
END {
        for (key in w) {
                count = w[key]
                if (count > 1) {
                        gsub(SUBSEP," ",key)
                        print count, key
                }
        }
}

Given the text in your question, this produces

2 Search Inside Yourself
2 Cultivate The Three
2 The Three Essential
2 Joy on Demand
2 Recognize and Cultivate
2 Three Essential Virtues
2 and Cultivate The
2 The Ideal Team
3 Ideal Team Player

As you can see, this may not be so useful.

Instead, we can collect the same count information and then do a second pass over the file, printing each line that contains a word triplet with a count larger than one:

NR == FNR {
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                w[$(i-2),$(i-1),$i]++

        next
}

{
        orig = $0
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                if (w[$(i-2),$(i-1),$i] > 1) {
                        print orig
                        next
                }
}

Testing on your file:

$ cat file
The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Joy on Demand: The Art of Discovering the Happiness Within
Crucial Conversations Tools for Talking When Stakes Are High
Joy on Demand

Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself
$ awk -f script.awk file file
The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues
Joy on Demand: The Art of Discovering the Happiness Within
Joy on Demand
Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself

Caveat: This awk program needs enough memory to store the text of your file about three times over, and may find duplicates in common phrases even when the entries are actually not truly duplicated (e.g. "how to cook" may be part of the titles of several books).

Related Question