How to display Chinese characters correctly on remote Red-Hat machine

character encodinginput-methodunicode

I am using Ubuntu14.04 to connect to a remote host.

Which its version is:

Linux version 2.6.32-431.11.5.el6.yyyzzz.x86_64 (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Thu Jul 3 09:42:34 CST 2014

My upload file on that machine won't display Chinese characters correctly.
And I open a file, type randomly Chinese Character with Ubuntu ibus input method. And it shows:

~R~V�~K~B~I~W个~I~N~T�饭~T~E

I searched online and tried the following 2 methods:

1: examine the locale

It shows:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=

Seems no problem.

2: install Chinese Language support package

I did:

yum install "@Chinese Support"

It installed 178M files on that machine.

After that, I open another file, and try typing some Chinese with ibus. But the problem remains, how to solve it?

update1
I did some more research after. I find that some characters can be typed out correctly(via Pinyin input method, ibus). like:

起 度 顿 客

They are all corresponding to their Pinyin. But there is a automate-generated space after each character( not typed by me).

If I try to type 启，杜，盾，刻 (they have the same Pinyin as the above 4 Chinese characters). I got:

�~P�~]~\ ~[� ~H�

For my experience, if the code converting is totally messed up. When I type a Pinyin, I shall get some wired characters which look like Chinese but actually were not, and they will never correspond to that Pinyin I typed.

This time, the things are little bit different.I can type some characters correctly(with an system-generated space), and others are indecipherable.

Best Answer

Basically, this may be a problem of mismatch between your locale, which is set to UTF-8, and the encoding of your Chineses character file, which may be encoded in gbk, gb2312, gb18030, or Big-5.

All those encoding listed above are incompatible with UTF-8.

Now, let's assume gbk is the encoding of your file. So when you try to show the contents of the file, a gbk encoded file is interpreted as a UTF-8 file, which causes the gibberish.

Here comes the solution.

Use luit. (Preferred)

$ whatis luit
luit (1)             - Locale and ISO 2022 support for Unicode terminals

luit -encoding gbk cat a_chinese_file.txt

Since most (if not every) encoding in use is compatible with ASCII, and if you only need characters in ASCII and another encoding, you can use the following two methods.

Change the encoding of your terminal

You may considered it since this method does not require additional package to be installed.
Change Your locale

But I think this requires you to install the corresponding locale.

Some details about the Chinese encoding mentioned above.

gbk, gb2312, gb18030 are encodings for Simplified Chinese.

If you are not sure which certain encoding your file is using, assume it gb18030.

Number of characters contained in each encoding follows this: gb18030 > gbk > gb2312. And the superior encoding is a superset of what follows.
Big-5 is the encoding for Traditional Chinese.

What's more, encoding for Simplified Chinese is sometimes refered as CP936 (Code Page 936, I think this name comes from Windows).

Related Solutions

Shell – bulk rename (or correctly display) files with special characters

I guess you see this � invalid character because the name contains a byte sequence that isn't valid UTF-8. File names on typical unix filesystems (including yours) are byte strings, and it's up to applications to decide on what encoding to use. Nowadays, there is a trend to use UTF-8, but it's not universal, especially in locales that could never live with plain ASCII and have been using other encodings since before UTF-8 even existed.

Try LC_CTYPE=en_US.iso88591 ls to see if the file name makes sense in ISO-8859-1 (latin-1). If it doesn't, try other locales. Note that only the LC_CTYPE locale setting matters here.

In a UTF-8 locale, the following command will show you all files whose name is not valid UTF-8:

grep-invalid-utf8 () {
  perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print'
}
find | grep-invalid-utf8

You can check if they make more sense in another locale with recode or iconv:

find | grep-invalid-utf8 | recode latin1..utf8
find | grep-invalid-utf8 | iconv -f latin1 -t utf8

Once you've determined that a bunch of file names are in a certain encoding (e.g. latin1), one way to rename them is

find | grep-invalid-utf8 |
rename 'BEGIN {binmode STDIN, ":encoding(latin1)"; use Encode;}
        $_=encode("utf8", $_)'

This uses the perl rename command available on Debian and Ubuntu. You can pass it -n to show what it would be doing without actually renaming the files.

Mutt: how to display emoji characters correctly

(I don't expect my comments to be answered since the question was posted last May, and the user hasn't been active since, but this might be useful to others.)

Make sure that you have the display charset in your muttr set to "utf-8", as follows:

set charset = "utf-8"

Best Answer

Here comes the solution.

Related Solutions

Shell – bulk rename (or correctly display) files with special characters

Mutt: how to display emoji characters correctly

Related Question