I am trying to find all 6
letter words using grep
. I currently have this:
grep "^.\{6\}$" myfile.txt
However, I am finding that I am also getting results such as: étuis
, étude
.
I suspect it has something do with the symbols above the e
in the above words.
Is there something I can do to ensure that this does not happen?
Thanks for your help!
Best Answer
grep
's idea of a character is locale-dependent. If you're in a non-Unicode locale and you grep from a file with Unicode characters in it then the character counts won't match up. If youecho $LANG
then you'll see the locale you're in.If you set the
LC_CTYPE
and/orLANG
environment variables to a value ending with ".UTF-8" then you will get the right behaviour:You can change your locale for just a single command by assigning the variable on the same line as the command.
With this configuration, multi-byte characters are considered as single characters. If you want to exclude non-ASCII characters entirely, some of the other answers have solutions for you.
Note that it's still possible for things to break, or at least not do exactly what you expect, in the presence of combining characters. Your
grep
may treat LATIN SMALL LETTER E + COMBINING CHARACTER ACUTE ABOVE differently than LATIN SMALL LETTER E WITH ACUTE.