Why does grep output lines that seemingly don’t match the expression

grep

Why does grep output lines that seemingly don't match the expression?

As mentioned in my comment this behaviour may be caused by a bug.

I am aware different locales affect character order but I thought the -o output below confirms this is not a problem here but I was wrong. Adding LC_ALL=C gives expected output.

I had this question after I saw locales affected the output.

[aa@bb grep-test]$ cat input.txt
aa bb
CC cc
dd ee

[aa@bb grep-test]$ LC_ALL=C grep -o [A-Z] input.txt
C
C
[aa@bb grep-test]$ grep -o [A-Z] input.txt
C
C
[aa@bb grep-test]$ LC_ALL=C grep [A-Z] input.txt
CC cc
[aa@bb grep-test]$ grep [A-Z] input.txt
aa bb
CC cc
dd ee
[aa@bb grep-test]$





[aa@bb tmp]$ cat test
aa bb
CC cc
dd ee

[aa@bb tmp]$ grep [A-Z] test
aa bb
CC cc
dd ee
[aa@bb tmp]$ grep -o [A-Z] test
C
C
[aa@bb tmp]$ grep -E [A-Z] test
aa bb
CC cc
dd ee
[aa@bb tmp]$ grep -n [A-Z] test
1:aa bb
2:CC cc
3:dd ee
[aa@bb tmp]$ echo [A-Z]
[A-Z]
[aa@bb tmp]$ grep -V
GNU grep 2.6.3
...
[aa@bb tmp]$ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
...
[aa@bb grep-test]$ command -v grep
/bin/grep
[aa@bb grep-test]$ rpm -q -f $(command -v grep)
grep-2.6.3-6.el6.x86_64
[aa@bb grep-test]$ echo grep [A-Z] input.txt | xxd
0000000: 6772 6570 205b 412d 5a5d 2069 6e70 7574  grep [A-Z] input
0000010: 2e74 7874 0a                             .txt.    
[aa@bb grep-test]$ cmd='grep [A-Z] input.txt'; echo $cmd | xxd; eval $cmd
0000000: 6772 6570 205b 412d 5a5d 2069 6e70 7574  grep [A-Z] input
0000010: 2e74 7874 0a                             .txt.
aa bb
CC cc
dd ee
[aa@bb grep-test]$ xxd input.txt
0000000: 6161 2062 620a 4343 2063 630a 6464 2065  aa bb.CC cc.dd e
0000010: 650a 0a                                  e..
[aa@bb grep-test]$

Best Answer

This looks like your locale collation rules being very ... helpful.

Try it with

LC_ALL=C grep [A-Z] input.txt

to test that idea.

I have

export LANG=en_US.UTF-8
export LC_COLLATE=C
export LC_NUMERIC=C

in my shell startup to avoid this kind of trouble while still getting my unicode goodness.

Related Question