Most common encoding for strings in C++ in Linux (and Unix?)

ccharacter encodingconventionsstring

For creating a C++ program that is source code level portable between Windows and Linux and handles internationalization well, there are IMHO three main encodings to consider:

The encoding of the C++ source code.
The encoding of external data.
The encoding(s) of strings and literals.

For the C++ source code there is not really any alternative to UTF-8 with BOM, at least if standard input and wide string literals should work on the Windows platform. UTF-8 without BOM causes Microsoft's Visual C++ compiler to assume Windows ANSI encoding for the source code, which is nice for UTF-8 output via std::cout, to the limited degree that that works (Windows console windows has lots of bugs here). However, then input via std::cin does not work.

And for the external data UTF-8 seems to be the de facto standard.

However, what about the internal literals and strings? Here I had the impression that narrow strings encoded as UTF-8 was the common convention in Linux. But recently two different persons have claimed otherwise, one claiming that the common convention for internal strings in international applications in Linux is UTF-32, and the other just claiming that there is some unspecified difference between Unix and Linux in this area.

As one who fiddles a little, on a hobby basis, with a micro-library intended to abstract away the Windows/Linux differences in this area, I … have to ask concretely

what is the common Linux convention for representing strings in a program?

I am pretty sure that there is a common convention that is so overwhelmingly common that this question has a Real Answer™.

An example showing e.g. how to Linux-conventionally reverse a string (which is complex to do directly with UTF-8, but which presumably is done by functions that are de facto standard in Linux?), would also be nice, i.e., as a question, what is a Linux-conventional version of this C++ program (the code as given works for Latin-1 as the C++ narrow text execution character set):

#include <iostream>
#include <algorithm>
#include <string>
using namespace std;

#define STATIC_ASSERT( cond )   static_assert( cond, #cond )

int main()
{
    string line;
    if( getline( cin, line ) )
    {
        static char const aSingleChar[] = "æ";
        STATIC_ASSERT( sizeof( aSingleChar ) - 1 == 1 );
        reverse( line.begin(), line.end() );

        cout << line << endl;
    }
}

Best Answer

This is just a partial answer, since your question is fairly broad.

C++ defines an "execution character set" (in fact, two of them, a narrow and a wide one).

When your source file contains something like:

char s[] = "Hello";

Then the numeric byte value of the letters in the string literal are simply looked up according to the execution encoding. (The separate wide execution encoding applies to the numeric value assigned to wide character constants L'a'.)

All this happens as part of the initial reading of the source code file into the compilation process. Once inside, C++ characters are nothing more than bytes, with no attached semantics. (The type name char must be one of the most grievous misnomers in C-derived languages!)

There is a partial exception in C++11, where the literals u8"", u"" and U"" determine the resulting value of the string elements (i.e the resulting values are globally unambiguous and platform-independent), but that does not affect how the input source code is interpreted.

A good compiler should allow you to specify the source code encoding, so even if your friend on an EBCDIC machine sends you her program text, that shouldn't be a problem. GCC offers the following options:

-finput-charset: input character set, i.e. how the source code file is encoded
-fexec-charset: execution character set, i.e. how to encode string literals
-fwide-exec-charset: wide execution character set, i.e. how to encode wide string literals

GCC uses iconv() for the conversions, so any encoding supported by iconv() can be used for those options.

I wrote previously about some opaque facilities provided by the C++ standard to handle text encodings.

Example: take the above code, char s[] = "Hello";. Suppose the source file is ASCII (i.e. the input encoding is ASCII). Then the compiler reads 99, and interprets it as c, and so on. When it comes to the literal, it reads 72, interprets it as H. Now it stores the byte value of H in the array which is determined by the execution encoding (again 72 if that is ASCII or UTF-8). When you write \xFF, the compiler reads 99 120 70 70, decodes it as \xFF, and writes 255 into the array.

Related Solutions

How to set VIM’s default encoding to UTF-8

When Vim reads an existing file, it tries to detect the file encoding. When writing out the file, Vim uses the file encoding that it detected (except when you tell it differently). So a file detected as UTF-8 is written as UTF-8, a file detected as Latin-1 is written as Latin-1, and so on.

By default, the detection process is crude. Every file that you open with Vim will be assumed to be Latin-1, unless it detects a Unicode byte-order mark at the top. A UTF-8 file without a byte-order mark will be hard to edit because any multibyte characters will be shown in the buffer as character sequences instead of single characters.

Worse, Vim, by default, uses Latin-1 to represent the text in the buffer. So a UTF-8 file with a byte-order mark will be corrupted by down-conversion to Latin-1.

The solution is to configure Vim to use UTF-8 internally. This is, in fact, recommended in the Vim documentation, and the only reason it is not configured that way out of the box is to avoid creating enormous confusion among users who expect Vim to operate basically as a Latin-1 editor.

In your .vimrc, add set encoding=utf-8 and restart Vim.

Or instead, set the LANG environment variable to indicate that UTF-8 is your preferred character encoding. This will affect not just Vim but any software which relies on LANG to determine how it should represent text. For example, to indicate that text should appear in English (en), as spoken in the United States (US), encoded as UTF-8 (utf-8), set LANG=en_US.utf-8.

Now Vim will use UTF-8 to represent the text in the buffer. Plus, it will also make a more determined effort to detect the UTF-8 encoding in a file. Besides looking for a byte-order mark, it will also check for UTF-8 without a byte-order mark before falling back to Latin-1. So it will no longer corrupt a file coded in UTF-8, and it should properly display the UTF-8 characters during the editing session.

For more information on how Vim detects the file encoding, see the fileencodings option in the Vim documentation.

For more information on setting the encoding that Vim uses internally, see the encoding option.

If you need to override the encoding used when writing a file back to disk, see the fileencoding option.

SSH – Working with Filenames in Different Encoding

Inside a terminal emulator that supports UTF-8, you can use the luit command to run a subshell (or other program) in a different locale. The locale setting that indicates character sets is LC_CTYPE.

LC_CTYPE=ru_RU.KOI8-R luit ls   # run one command
LC_CTYPE=ru_RU.KOI8-R luit      # start a shell (type Ctrl+D or exit to return to the parent shell)

If you have a whole tree of files in a different encoding, I recommend (if possible) mounting it through convmvfs.

mkdir ~/net/ivan@example.com.KOI8-R ~/net/ivan@example.com.UTF-8
sshfs ivan@example.com: ~/net/ivan@example.com.KOI8-R
convmvfs -o srcdir=~/net/ivan@example.com.KOI8-R,icharset=KOI8-R,ocharset=UTF-8 ~/net/ivan@example.com.UTF-8
ls ~/net/ivan@example.com.UTF-8

Best Answer

Related Solutions

How to set VIM’s default encoding to UTF-8

SSH – Working with Filenames in Different Encoding

Related Question