Regex Pattern – Regex for All 10 Letter Words with Unique Letters

grepregular expression

I am trying to write a regex that will display all words that are 10 characters long, and none of the letters are repeating.

So far, I have got

grep --colour -Eow '(\w{10})'

Which is the very first part of the question. How would I go about checking for the "uniqueness"? I really don't have a clue, apart from that I need to use back references.

Best Answer

grep -Eow '\w{10}' | grep -v '\(.\).*\1'

excludes words that have two identical characters.

grep -Eow '\w{10}' | grep -v '\(.\)\1'

excludes the ones that have repeating characters.

POSIXly:

tr -cs '[:alnum:]_' '[\n*]' |
   grep -xE '.{10}' |
   grep -v '\(.\).*\1'

tr puts words on their own line by converting any sequence of non-word-characters (complement of alpha-numeric and underscore) to a newline character.

Or with one grep:

tr -cs '[:alnum:]_' '[\n*]' |
   grep -ve '^.\{0,9\}$' -e '.\{11\}' -e '\(.\).*\1'

(exclude lines of less than 10 and more than 10 characters and those with a character appearing at least twice).

With one grep only (GNU grep with PCRE support or pcregrep):

grep -Po '\b(?:(\w)(?!\w*\1)){10}\b'

That is, a word boundary (\b) followed by a sequence of 10 word characters (provided that each is not followed by a sequence of word characters and themselves, using the negative look-ahead PCRE operator (?!...)).

We're lucky that it works here, as not many regexp engines work with backreferences inside repeating parts.

Note that (with my version of GNU grep at least)

grep -Pow '(?:(\w)(?!\w*\1)){10}'

Doesn't work, but

grep -Pow '(?:(\w)(?!\w*\2)){10}'

does (as echo aa | grep -Pw '(.)\2') which sounds like a bug.

You may want:

grep -Po '(*UCP)\b(?:(\w)(?!\w*\1)){10}\b'

if you want \w or \b to consider any letter as a word component and not just the ASCII ones in non-ASCII locales.

Another alternative:

grep -Po '\b(?!\w*(\w)\w*\1)\w{10}\b'

That is a word boundary (one that is not followed by a sequence of word characters one of which repeats) followed by 10 word characters.

Things to possibly have at the back of one's mind:

  • Comparison is case sensitive, so Babylonish for instance would be matched, since all the characters are different even though there are two Bs, one lower and one upper case (use -i to change that).
  • for -w, \w and \b, a word is a letter (ASCII ones only for GNU grep for now, the [:alpha:] character class in your locale if using -P and (*UCP)), decimal digits or underscore.
  • that means that c'est (two words as per the French definition of a word) or it's (one word according to some English definitions of a word) or rendez-vous (one word as per the French definition of a word) are not considered one word.
  • Even with (*UCP), Unicode combining characters are not considered as word components, so téléphone ($'t\u00e9le\u0301phone') is considered as 10 characters, one of which non-alpha. défavorisé ($'d\u00e9favorise\u0301') would be matched even though it's got two é because that's 10 all different alpha characters followed by a combining acute accent (non-alpha, so there's a word boundary between the e and its accent).
Related Question