Regex Pattern – Regex for All 10 Letter Words with Unique Letters

grepregular expression

I am trying to write a regex that will display all words that are 10 characters long, and none of the letters are repeating.

So far, I have got

grep --colour -Eow '(\w{10})'

Which is the very first part of the question. How would I go about checking for the "uniqueness"? I really don't have a clue, apart from that I need to use back references.

Best Answer

grep -Eow '\w{10}' | grep -v '\(.\).*\1'

excludes words that have two identical characters.

grep -Eow '\w{10}' | grep -v '\(.\)\1'

excludes the ones that have repeating characters.

POSIXly:

tr -cs '[:alnum:]_' '[\n*]' |
   grep -xE '.{10}' |
   grep -v '\(.\).*\1'

tr puts words on their own line by converting any sequence of non-word-characters (complement of alpha-numeric and underscore) to a newline character.

Or with one grep:

tr -cs '[:alnum:]_' '[\n*]' |
   grep -ve '^.\{0,9\}$' -e '.\{11\}' -e '\(.\).*\1'

(exclude lines of less than 10 and more than 10 characters and those with a character appearing at least twice).

With one grep only (GNU grep with PCRE support or pcregrep):

grep -Po '\b(?:(\w)(?!\w*\1)){10}\b'

That is, a word boundary (\b) followed by a sequence of 10 word characters (provided that each is not followed by a sequence of word characters and themselves, using the negative look-ahead PCRE operator (?!...)).

We're lucky that it works here, as not many regexp engines work with backreferences inside repeating parts.

Note that (with my version of GNU grep at least)

grep -Pow '(?:(\w)(?!\w*\1)){10}'

Doesn't work, but

grep -Pow '(?:(\w)(?!\w*\2)){10}'

does (as echo aa | grep -Pw '(.)\2') which sounds like a bug.

You may want:

grep -Po '(*UCP)\b(?:(\w)(?!\w*\1)){10}\b'

if you want \w or \b to consider any letter as a word component and not just the ASCII ones in non-ASCII locales.

Another alternative:

grep -Po '\b(?!\w*(\w)\w*\1)\w{10}\b'

That is a word boundary (one that is not followed by a sequence of word characters one of which repeats) followed by 10 word characters.

Things to possibly have at the back of one's mind:

Comparison is case sensitive, so Babylonish for instance would be matched, since all the characters are different even though there are two Bs, one lower and one upper case (use -i to change that).
for -w, \w and \b, a word is a letter (ASCII ones only for GNU grep for now, the [:alpha:] character class in your locale if using -P and (*UCP)), decimal digits or underscore.
that means that c'est (two words as per the French definition of a word) or it's (one word according to some English definitions of a word) or rendez-vous (one word as per the French definition of a word) are not considered one word.
Even with (*UCP), Unicode combining characters are not considered as word components, so téléphone ($'t\u00e9le\u0301phone') is considered as 10 characters, one of which non-alpha. défavorisé ($'d\u00e9favorise\u0301') would be matched even though it's got two é because that's 10 all different alpha characters followed by a combining acute accent (non-alpha, so there's a word boundary between the e and its accent).

Related Solutions

Number of Backslashes Needed for Escaping Regex Backslash on Command-Line

For the unquoted example, each \\ pair passes one backslash to grep, so 4 backslashes pass two to grep, which translates to a single backslash. 6 backslashes pass three to grep, translating to one backslash and one \c, which is equal to c. One additional backslash does not change anything, because it is translated \c -> c by the shell. Eight backslashes in the shell are four in grep, translated to two, so this does not match anymore.

For the example in double quotes, note what follows your second quote from the bash manpage:

The backslash retains its special meaning only when followed by one of the following characters: $, `, ", \, or newline.

I.e. when you give an odd number of backslashes, the sequence ends in \c, which would be equal to c in the unquoted case, but when quoted, the backslash looses its special meaning, so \c is passed to grep. That is why the range of "possible" backslashes (i.e. those that make up a pattern matching your example file) slides down by one.

Linux – Grep/awk/sed for lines composed of only two letters, and lines that start with a letter and meet a certain length

Give grep with Perl Compatible REgexp module a try:

to remove two-letters combinations:

pcregrep -Mv '>.*\n([ACGT])\1*([ACGT])\2*(\1|\2)*$' file

output:

>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
CTCGTCGCATCACAAAGGGAT
>NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
>NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
CAGCCC
>NB501013:9:HJJ75BGXX:4:22611:20567:13384/2
GAATA

to remove combination of 5-letters or less:

 pcregrep -Mv '>.*\n[ACGT]{1,5}$' file

output:

>NB501013:9:HJJ75BGXX:4:13609:24076:18015/2
GGGGGGGAAAAAAA
>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
CTCGTCGCATCACAAAGGGAT
>NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
>NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
CAGCCC

Best Answer

Related Solutions

Number of Backslashes Needed for Escaping Regex Backslash on Command-Line

Linux – Grep/awk/sed for lines composed of only two letters, and lines that start with a letter and meet a certain length

Related Question