Ubuntu – Conflict between variable substitution and CJK characters in BASH

bashlubuntulxterminalutf-8

I encountered a problem with variable substitution in the BASH shell.
Say you define a variable a. Then the command

    $> echo ${a//[0-4]/}

prints its value with all the numbers ranged between 0 and 4 removed:

    $> a="Hello1265-3World"
    $> echo ${a//[0-4]/}
    Hello65-World

This seems to work just fine, but let's take a look at the next example:

    $> b="你1265-3好"
    $> echo ${b//[0-4]/}
    你1265-3好

Substitution did not take place: I assume that is because b contains CJK characters. This issue extends to all cases in which square brackets are involved. Surprisingly enough, variable substitution without square brackets works fine in both cases:

    $> a="Hello1265-3World"
    $> echo ${a//2/}
    Hello165-3World
    $> b="你1265-3好"
    $> echo ${b//2/}
    你165-3好

Is it a bug or am I missing something?

I use Lubuntu 12.04, terminal is lxterminal and echo $BASH_VERSION returns 4.2.24(1)-release.

EDIT: Andrew Johnson in his comment stated that with gnome-terminal 4.2.37(1)-release the command works fine. I wonder whether it is a problem of lxterminal or of its specific 4.2.24(1)-release version.

EDIT: I tried it with gnome-terminal on Lubuntu 12.04 but the problem is still there…

Best Answer

Short answer:

set LC_ALL=C for the behaviour you expect

pauhel@permafrost:~$ b="你1265-3好"
paul@permafrost:~$ echo ${b//[0-2]/}
你1265-3好
paul@permafrost:~$ export LC_ALL=C
paul@permafrost:~$ echo ${b//[0-2]/}
你65-3好

Long answer:

The behaviour you expect relies on collation ordering which is locale/OS implementation dependent. The POSIX standard leaves it specifically undefined except for the C locale. (Bash calls an external library for this and, at a guess, it looks like that falls back to ASCII ordering if only ASCII characters are present).

Later versions of bash have a shell option that lets you specify something like you expect.

See:

https://groups.google.com/forum/#!topic/gnu.bash.bug/S6cN9KI4vK4/discussion

for more background.

Related Question