Bash – Command line method to find repeat-word typos, with line numbers

aspellawkbashcommand linetext processing

Updated: Clarify line number requirement, some verbosity reductions

From the command line, is there a way to:

  • check a file of English text
  • to find repeat-word typos,
  • along with line numbers where they are found,

in order to help correct them?

Example 1

Currently, to help finish an article or other piece of English writing, aspell -c text.txt is helpful for catching spelling errors. But, not helpful when the error is an unintentional consecutive repetition of a word.

highlander_typo.txt:

There can be only one one.

Running aspell:

$ aspell -c highlander_typo.txt

Probably since aspell is a spell-checker, not a grammar-checker, so repeat word typos are beyond its intended feature scope. Thus the result is this file passes aspell's check because nothing is "wrong" in terms of individual word spelling.

The correct sentence is There can be only one., the second one is an unintended repeat-word typo.

Example 2

But a different situation is for example kylie_minogue.txt:

La la la

Here the repetition is not a typo, as these are part of an artist's song lyrics.

So the solution should not presume and "fix" anything by itself, otherwise it could overwrite intentional repeated words.

Example 3: Multi-line

jefferson_typo.txt:

He has has refused his Assent to Laws, the most wholesome and necessary
for the public good.
He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
Assent should be be obtained; and when so suspended, he has utterly
neglected to attend to them.

Modified from The Declaration of Independence

In the above six lines,

  • 1: He has has refused should be He has refused, the second has is a repeat-word typo
  • 5: should be be obtained should be should be obtained, the second be is a repeat-word typo

However, did you notice a third repeat-word typo?

  • 3: ... immediate and
  • 4: and pressing ...

This is also a repeat-word typo because though they are on separate lines they are still part of the same English sentence, the trailing end of the line above has a word that is accidentally added at the start of the next line. Rather tricky to detect by eye due to the repetition being on opposite sides of a passage of text.

Intended output

  • an interactive program with a process similar to aspell -c yet able to detect repeat-words, or,

  • a script or combination of commands able to extract line numbers and the suspected repeat words. This info makes it easier to use an editor such as vim to jump to the repeat words and make fixes where appropriate.

Using above multi-line jefferson_typo.txt, the desired output would be something like:

1: has has
3: and
4: and
5: be be

or:

1: He [has has] refused his Assent to Laws, the most wholesome and necessary
3: He has forbidden his Governors to pass Laws of immediate [and]
4: [and] pressing importance, unless suspended in their operation till his
5: Assent should [be be] obtained; and when so suspended, he has utterly

I am actually not entirely sure how to display the difficult case of inter-line or cross-line repeat-word, such as the and repetition above, so don't worry if your solution doesn't resemble this exactly.

But I hope that, like the above, it shows:

  • relevant original input's line number
  • some way to draw attention to what repeated, especially helpful if the line of text is also quite long.
  • if the full line is displayed to give context (credit: @Wildcard), then there needs to be a way to somehow render the repeated word or words distinctively. The example shown here marks the repetition by enclosing them within ASCII characters [ ]. Alternatively, perhaps mimic grep --colors=always to colorize the line's matches for display in a color terminal

Other considerations

  • text, should stay as plain text files
  • no GUI solutions please, just textual. ssh -X X11 forwarding not reliably available and need to edit over ssh

Unsuccessful attempts

To try to find duplicates, uniq came to mind, so the plan was to first determine how to get repeat-word recognition to work on a single line at first.

In order to use uniq we would need to first convert words on a line, to becoming one word per line.

$ tr ' ' '\n' < highlander_typo.txt
There
can
be
only
one
one.

Unfortunately:

$ tr ' ' '\n' < highlander_typo.txt | uniq -D

Nothing.

This is because for -D option, which normally reveals duplicates, input has to be exactly a duplicate line. Unfortunately the period . at the end of the repeated word one negates this. It just looks like a different line. Not sure how I would work around arbitrary punctuation marks such as this period, and somehow add it back after tr processing.

This was unsuccessful. But if it were successful, next there would need to be a way to include this line's line number, since the input file could have hundreds of lines and it would help to indicate which line of the input file, that the repeat-word was detected on.

This single-line code processing would perhaps be part of a parent loop in order to do some kind of line-by-line multi-line processing and thus be able to process all lines in a file, but unfortunately getting past even single-line repeat-word recognition has been problematic.

Best Answer

Edited: added install and demo

You need to take care of at least some edge cases, like

  • repeated words at the end (and beginning) of the line.
  • search should be case insensitive, because of frequent errors like The the apple.
  • probably you want to restrict search only to word constituent to not match something like ( ( a + b) + c ) (repeated opening parentheses.
  • only full words should match to eliminate the thesis
  • When it comes to human language Unicode characters inside words should properly interpreted

All in all I recommend pcregrep solution:

pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file

Obviously color and line number (n option) is optional, but usually nice to have.

Install

On Debian-based distributions you can install via:

$ sudo apt-get install pcregrep

Example

Run the command on jefferson_typo.txt to see:

$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly

The above is just a text capture, but on a color-supported terminal, matches are colorized:

  • has has
  • and
  • and
  • be be
Related Question