Updated: Clarify line number requirement, some verbosity reductions
From the command line, is there a way to:
- check a file of English text
- to find repeat-word typos,
- along with line numbers where they are found,
in order to help correct them?
Example 1
Currently, to help finish an article or other piece of English writing, aspell -c text.txt
is helpful for catching spelling errors. But, not helpful when the error is an unintentional consecutive repetition of a word.
highlander_typo.txt
:
There can be only one one.
Running aspell
:
$ aspell -c highlander_typo.txt
Probably since aspell
is a spell-checker, not a grammar-checker, so repeat word typos are beyond its intended feature scope. Thus the result is this file passes aspell
's check because nothing is "wrong" in terms of individual word spelling.
The correct sentence is There can be only one.
, the second one
is an unintended repeat-word typo.
Example 2
But a different situation is for example kylie_minogue.txt
:
La la la
Here the repetition is not a typo, as these are part of an artist's song lyrics.
So the solution should not presume and "fix" anything by itself, otherwise it could overwrite intentional repeated words.
Example 3: Multi-line
jefferson_typo.txt
:
He has has refused his Assent to Laws, the most wholesome and necessary
for the public good.
He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
Assent should be be obtained; and when so suspended, he has utterly
neglected to attend to them.
Modified from The Declaration of Independence
In the above six lines,
- 1:
He has has refused
should beHe has refused
, the secondhas
is a repeat-word typo - 5:
should be be obtained
should beshould be obtained
, the secondbe
is a repeat-word typo
However, did you notice a third repeat-word typo?
- 3:
... immediate and
- 4:
and pressing ...
This is also a repeat-word typo because though they are on separate lines they are still part of the same English sentence, the trailing end of the line above has a word that is accidentally added at the start of the next line. Rather tricky to detect by eye due to the repetition being on opposite sides of a passage of text.
Intended output
-
an interactive program with a process similar to
aspell -c
yet able to detect repeat-words, or, -
a script or combination of commands able to extract line numbers and the suspected repeat words. This info makes it easier to use an editor such as
vim
to jump to the repeat words and make fixes where appropriate.
Using above multi-line jefferson_typo.txt
, the desired output would be something like:
1: has has
3: and
4: and
5: be be
or:
1: He [has has] refused his Assent to Laws, the most wholesome and necessary
3: He has forbidden his Governors to pass Laws of immediate [and]
4: [and] pressing importance, unless suspended in their operation till his
5: Assent should [be be] obtained; and when so suspended, he has utterly
I am actually not entirely sure how to display the difficult case of inter-line or cross-line repeat-word, such as the and
repetition above, so don't worry if your solution doesn't resemble this exactly.
But I hope that, like the above, it shows:
- relevant original input's line number
- some way to draw attention to what repeated, especially helpful if the line of text is also quite long.
- if the full line is displayed to give context (credit: @Wildcard), then there needs to be a way to somehow render the repeated word or words distinctively. The example shown here marks the repetition by enclosing them within ASCII characters
[
]
. Alternatively, perhaps mimicgrep --colors=always
to colorize the line's matches for display in a color terminal
Other considerations
- text, should stay as plain text files
- no GUI solutions please, just textual.
ssh -X
X11 forwarding not reliably available and need to edit overssh
Unsuccessful attempts
To try to find duplicates, uniq
came to mind, so the plan was to first determine how to get repeat-word recognition to work on a single line at first.
In order to use uniq
we would need to first convert words on a line, to becoming one word per line.
$ tr ' ' '\n' < highlander_typo.txt
There
can
be
only
one
one.
Unfortunately:
$ tr ' ' '\n' < highlander_typo.txt | uniq -D
Nothing.
This is because for -D
option, which normally reveals duplicates, input has to be exactly a duplicate line. Unfortunately the period .
at the end of the repeated word one
negates this. It just looks like a different line. Not sure how I would work around arbitrary punctuation marks such as this period, and somehow add it back after tr
processing.
This was unsuccessful. But if it were successful, next there would need to be a way to include this line's line number, since the input file could have hundreds of lines and it would help to indicate which line of the input file, that the repeat-word was detected on.
This single-line code processing would perhaps be part of a parent loop in order to do some kind of line-by-line multi-line processing and thus be able to process all lines in a file, but unfortunately getting past even single-line repeat-word recognition has been problematic.
Best Answer
Edited: added install and demo
You need to take care of at least some edge cases, like
The the apple
.( ( a + b) + c )
(repeated opening parentheses.the thesis
All in all I recommend
pcregrep
solution:Obviously color and line number (
n
option) is optional, but usually nice to have.Install
On Debian-based distributions you can install via:
Example
Run the command on
jefferson_typo.txt
to see:The above is just a text capture, but on a color-supported terminal, matches are colorized: