Matching string with a fixed number of characters using grep

grepregular expression

I am trying to find all 6 letter words using grep. I currently have this:

grep "^.\{6\}$" myfile.txt

However, I am finding that I am also getting results such as: étuis, étude.

I suspect it has something do with the symbols above the e in the above words.

Is there something I can do to ensure that this does not happen?

Thanks for your help!

Best Answer

grep's idea of a character is locale-dependent. If you're in a non-Unicode locale and you grep from a file with Unicode characters in it then the character counts won't match up. If you echo $LANG then you'll see the locale you're in.

If you set the LC_CTYPE and/or LANG environment variables to a value ending with ".UTF-8" then you will get the right behaviour:

$ cat data
étuis
letter
éééééé
$ LANG=C grep -E '^.{6}$' data
étuis
letter
$ LANG=en_US.UTF_8 grep -E '^.{6}$' data
letter
éééééé
$

You can change your locale for just a single command by assigning the variable on the same line as the command.

With this configuration, multi-byte characters are considered as single characters. If you want to exclude non-ASCII characters entirely, some of the other answers have solutions for you.

Note that it's still possible for things to break, or at least not do exactly what you expect, in the presence of combining characters. Your grep may treat LATIN SMALL LETTER E + COMBINING CHARACTER ACUTE ABOVE differently than LATIN SMALL LETTER E WITH ACUTE.

Related Solutions

Grep ‘OR’ regex problem

With normal regex, the characters (, | and ) need to be escaped. So you should use

$ grep "^ID.*\(ETS\|FBS\)" my_file.txt

You don't need the escapes when you use the extended regex (-E)option. See man grep, section "Basic vs Extended Regular Expressions".

Invalid back reference using grep

No, it's not correct. I have no idea what the \1{3} is supposed to be but that's what is causing you problems. If you want to find lines that contain three repeated characters followed by three other repeated characters, you can use this:

grep -E '([a-z])\1{2}([a-z])\2{2}'

The \1 refers to the first captured group. You can capture groups by using parentheses. Then, \1 is the 1st such group and \2 is the second and so on. Since you had no captured groups, grep was complaining about an invalid reference since it had nothing to refer to. So, in the regex above, the parentheses are capturing the two groups. Then, you want {2} and not {3} since the initial match is also counted.

You don't specify whether you need the match to be a word or whether you also want to match within words. If you want the entire word to match (and exclude things like aaaabbb, use this instead:

grep -wE '([a-z])\1{2}([a-z])\2{2}'

To print only the matched portion of the line (the word) and not the entire line, use (GNU grep only):

grep -owE '([a-z])\1{2}([a-z])\2{2}'

Best Answer

Related Solutions

Grep ‘OR’ regex problem

Invalid back reference using grep

Related Question