Vim: How to handle Unicode files with text in multiple (more than two) languages

unicodevim

What settings do I need to set in Vim/gVim to be able to view Unicode text files which have text in many languages?

You may make these assumptions:

The number of languages is more than two.
Some of the languages are Chinese, Japanese, and Korean.
It is enough if I can view these files in gVim (not necessarily Vim).
gVim 7.0 running on Windows.

Here is a text sample, which when saved in Unicode opens fine in Notepad, but shows up as gibberish in gVim:

This is English.
这是中文。
これは日本です。
한국입니다.
ಇದು ಕನ್ನಡ.

Best Answer

Using gVim on Windows, I did the following two things:

:set encoding=utf-8
:set guifont=*

The second command brings up a font picker. By choosing the font "@MS Mincho", I got some of the Japanese characters to display, but oddly they were rotated 90 degrees to the left.

Anyway, you'll have to set the encoding before loading or pasting text into gVim (otherwise it might just convert them to all question marks). Then you'll have to find a font that is (a) fixed width, and (b) includes the characters you want to see. I don't seem to have such a font on my system at the moment, but you may.

Related Solutions

Unicode, Unicode Big Endian or UTF-8? What is the difference? Which format is better

Dunno. Which is better: a saw or a hammer? :-)

Unicode isn't UTF

There's a bit in the article that's a bit more relevant to the subject at hand though:

UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:

“Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings

UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.
UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).

Also see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Windows 7 – Czech Language Input Method and Font Support in gvim 7.4

Problem is that coding Latin-2 (iso-8859-2) and Windows-1250 (used by windows) differ in some characters:

ž, š, ť, Ž, Š, Ť

All differences are summarized at Wikipedia or Czech version

If you set encoding=cp1250, then it'll be ok.

I don't want to prolong comments so I'm adding that here.

There is a problem that standard code page uses only 1byte (hex 100) for characters, so there are ISO standards for different languages. If you have set encoding iso-8859-2 and trying to add unicode character (hex 160) Š, than gvim loops over to character (hex 60). You have to use codes ISO-8859-2, where Š ìs (hex 089). Other codes here: http://cs.wikipedia.org/wiki/ISO_8859-2

UTF-8 on the other hand uses 2bytes and contains simultaineously all? letters and signs. So if you use set encoding=utf-8 and then add U0160 or U5927 you'll get Š resp. 大.

Fixedsys contains ů and Ů, OR there is a difference in font versions between Windows language mutations (I use Czech version), but I doubt that. You can use windows utility Charmap.exe, there you can select desired font and check which characters it supports, even their unicode code.

I was trying briefly some of default fonts in GVim and there seems to be some that supports Chinese (ie MS Mincho), but I don't which signs are important.

GVim seems to be supporting only monospace character fonts so, if you'll be searching for another font be aware of that. :)

Best Answer

Related Solutions

Unicode, Unicode Big Endian or UTF-8? What is the difference? Which format is better

Windows 7 – Czech Language Input Method and Font Support in gvim 7.4

Related Question